利用wget 和队列模拟网络爬虫（不带判重程序）

2025-02-07 技术教程

/*利用wget指令和队列模拟实现网络爬虫利用自己的站点wzsts.host3v.com测试了一下有一点错误文件运行后拿到index.html对于连接仅仅可以拿到html和htm结尾的而.com的没有考虑（设计文件可能多）一次测试后了解到如下问题：1.文件应该有树形话还原网站目录2.文本记录的连接应该分类化项目运行需要su根权限进行因为mv命令本项目仅仅适合在linux下运行本项目给真真的linux爬虫做铺垫*/#include<bits/stdc++.h>usingnamespacestd;#include<stdlib.h>#include<iostream>#include<fstream>strings("index.html");queue<string>q;//略去一段intmain(){//index.htmlsystem("wgetwzsts.host3v.com");ofstreamout("out.txt");stringmv("mv");stringhtml("html");q.push(s);while(!q.empty()){out<<q.front().c_str();out<<"\n";if(strstr(q.front().c_str(),".html")||strstr(q.front().c_str(),".htm")){fun(q.front().c_str());q.pop();stringt("wget");t=t+s;cout<<t.c_str()<<endl;system(t.c_str());}stringss(q.front().c_str());ss=mv+ss.c_str()+html;cout<<ss<<endl;system(ss.c_str());}out.close();return0;}