Python爬虫，使用BeautifulSoup解析页面结果

2024-12-18 技术教程

Python爬虫，使用BeautifulSoup可以轻松解析页面结果，下面是使用该方法爬取boss页面的职位信息：包括职位名称、薪资、地点、公司名称、公司融资情况等信息。通过这个示例可以轻松看到BeautifulSoup的使用方法。

import requestsfrom bs4 import BeautifulSoupfrom middlewares import get_random_proxy,get_random_agentimport timeclass Boss_Spider(object): def __init__(self, page=3): self.proxies = [] self.verify_pro = [] self.page = page self.headers = {} #第一步：获取首页所有招聘连接 def Parse_pre(self): base_url = 'https://www.zhipin.com/' headers = get_random_agent() proxy = get_random_proxy() time.sleep(1) resp = requests.get(base_url, headers=headers) if resp.status_code == 200: soup = BeautifulSoup(resp.text, 'lxml') for job_menu in soup.find_all(class_='menu-sub'): for li in job_menu.find_all('li'): job_type = li.find('h5').get_text() for job_list in li.find_all('a'): job_sub = job_list.get_text() job_uri = job_list['href'] for i in range(0,11): job_url = base_url + job_uri + '?page=%d&ka=page-%d' %(i,i) requests.get(job_url,headers=headers,proxies=proxy) meta = { 'job_type': job_type, 'job_sub': job_sub, } self.Parse_index(meta=meta,url=job_url) #爬取具体页数据 def Parse_index(self,meta,url): headers = get_random_agent() proxy = get_random_proxy() time.sleep(1) resp = requests.get(url, headers=headers) if resp.status_code == 200: soup = BeautifulSoup(resp.text, 'lxml') print(soup) for li in soup.find(class_='job-list').find_all('li'): print('###########') position = li.find(class_='job-title').get_text() salary = li.find(class_='red').get_text() add = li.find('p').get_text() need = li.find('p').find('em').get_text() company_name = li.find(class_='company-text').find('a').get_text() tag = li.find(class_='company-text').find('p') print(position,"$$$",salary,"$$$",add,"$$$",need,"$$$",company_name,"$$$",tag)if __name__ == '__main__': b = Boss_Spider() b.Parse_pre()

运行输出结果如下：
后端开发 $$$ 15-30K $$$ 北京朝阳区朝外3-5年本科 $$$ $$$ 米花互动 $$$ 游戏不需要融资20-99人
###########
后端开发工程师 $$$ 35-55K $$$ 北京朝阳区望京经验不限本科 $$$ $$$ 云账户 $$$ 移动互联网C轮100-499人
###########