BeautifulSoup，一碗美丽的汤，一个隐藏的大坑

2024-12-17 技术教程

python 网络爬虫常用的4大解析库助手：re正则、etree xpath、scrapy xpath、BeautifulSoup。（因为etree xpath和scrapy xpath用法上有较大的不同，故没有归为一类），本文来介绍BeautifulSoup一个少为人知的坑，见示例：例1(它是长得不一样，柬文勿怪)： content = """ <html> <body> <div class="td-post-content td-pb-padding-side"> <img alt="" class="alignnone size-full wp-image-122426" data-recalc-dims="1" height="352" src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching.jpg?resize=630%2C352&ssl=1" width="630"/> <img alt="" class="alignnone size-full wp-image-122427" data-recalc-dims="1" height="473" src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&ssl=1" width="630"/> ចំណែកឯប្រេងដូងវិញ មានផ្ទុកអាស៊ីតខ្លាញ់អូមេហ្គា៣ ដែលល្អបំផុតសម្រាប់បំផ្លាញ់មីក្រុបដែលមានវត្តមាននៅក្នុងតំបន់រន្ធគូថ ហេតុនេះហើយទើបការឆ្លងមេរោគ និងរមាស់ត្រូវបានទប់ស្កាត់។ <img alt="" class="alignnone size-full wp-image-122427" data-recalc-dims="1" height="473" src="https://i1.wp.com/img.postnews.com.kh/2017/01/Anal-Itching1.jpg?resize=630%2C473&ssl=1" width="630"/> <img alt="" class="alignnone size-full wp-image-122428" data-recalc-dims="1" height="473" src="https://i2.wp.com/img.postnews.com.kh/2017/01/Anal-Itching2.jpg?resize=630%2C473&ssl=1" width="630"/> ចំណាំ៖ ប្រសិនបើអ្នករមាស់ខ្លាំង មានការឈឺចាប់ ហើយមានឈាមហូរទៀតនោះ ត្រូវប្រញាប់ទៅជួបជាមួយគ្រូពេទ្យភ្លាម៕ </div> </body> </html>""" soup = BeautifulSoup(content)img_lst = []inner_src_list = soup.find_all('img', src=True)for i, src in enumerate(inner_src_list): url=src["src"].replace("&ssl", "&ssl") print(url)print(soup.prettify()) # content = soup.prettify() # src的打印结果一样img_tags = soup.find_all('img')for img in img_tags: print(img['src'])控制台打印输出如下： ![](https://cache.yisu.com/upload/information/20200310/57/120424.jpg?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) ![](https://cache.yisu.com/upload/information/20200310/57/120431.jpg?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) ![](https://cache.yisu.com/upload/information/20200310/57/120432.jpg?x-oss-process=image/watermark,size_16,text_QDUxQ1RP5Y2a5a6i,color_FFFFFF,t_100,g_se,x_10,y_10,shadow_90,type_ZmFuZ3poZW5naGVpdGk=) 怎么会这样：文本中的‘amp;’字符怎么消失了？解释如下：BeautifulSoup在提取src时内部会自动把符号‘&’转义成'&'，【网页解析有时不一定要眼前的直觉】【不仅bs如此， etree xpath和scrapy xpath也是一样】例2：文本同上 soup = BeautifulSoup(content) img_lst = [] inner_src_list = soup.find_all('img', src=True) # 注意比较 for i, src in enumerate(inner_src_list): url=src["src"].replace("&ssl", "&ssl") print(url) inner_src_list = soup.find_all('img', attr={'src':True}) # 注意比较 for i, src in enumerate(inner_src_list): url=src["src"].replace("&ssl", "&ssl") print(url) 这里不作打印了，直接说明现象，第一个print正常打印，第二个print输出为空，为什么？解释如下：第一个find_all，把src=True视为存在src属性的img标签，第二个find_all，把attr={'src', True}视为存在src且属性值为True的img标签，所以结果可想而知！上述如有不正之处，欢迎指出，谢谢！