Python爬虫中遍历文档树的方法

2024-11-13 技术教程

小编给大家分享一下Python爬虫中遍历文档树的方法，希望大家阅读完这篇文章后大所收获，下面让我们一起去探讨吧！

遍历文档树

1.直接子节点：.contents .children属性

.content

Tag的.content属性可以将Tag的子节点以列表的方式输出

#!/usr/bin/python3#-*-coding:utf-8-*-frombs4importBeautifulSouphtml="""<html><head><title>TheDormouse'sstory</title></head><body><pclass="title"name="dromouse">TheDormouse'sstory<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"class="sister"id="link1"></a>,<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;andtheylivedatthebottomofawell.<pclass="story">..."""#创建BeautifulSoup对象，指定lxml解析器soup=BeautifulSoup(html,"lxml")#输出方式为列表print(soup.head.contents)print(soup.head.contents[0])

运行结果

[<title>TheDormouse'sstory</title>]<title>TheDormouse'sstory</title>

.children

它返回的不是一个列表，不过我们可以通过遍历获取所有的子节点。

#!/usr/bin/python3#-*-coding:utf-8-*-frombs4importBeautifulSouphtml="""<html><head><title>TheDormouse'sstory</title></head><body><pclass="title"name="dromouse">TheDormouse'sstory<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"class="sister"id="link1"></a>,<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;andtheylivedatthebottomofawell.<pclass="story">..."""#创建BeautifulSoup对象，指定lxml解析器soup=BeautifulSoup(html,"lxml")#输出方式为列表生成器对象print(soup.head.children)#通过遍历获取所有子节点forchildinsoup.head.children:print(child)

运行结果

<list_iteratorobjectat0x008FF950><title>TheDormouse'sstory</title>

2.所有子孙节点：.descendants属性

上面讲的.contents和.children属性仅包含Tag的直接子节点，.descendants属性可以对所有Tag的子孙节点进行递归循环，和children类似，我们也需要通过遍历的方式获取其中的内容。

#!/usr/bin/python3#-*-coding:utf-8-*-frombs4importBeautifulSouphtml="""<html><head><title>TheDormouse'sstory</title></head><body><pclass="title"name="dromouse">TheDormouse'sstory<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"class="sister"id="link1"></a>,<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;andtheylivedatthebottomofawell.<pclass="story">..."""#创建BeautifulSoup对象，指定lxml解析器soup=BeautifulSoup(html,"lxml")#输出方式为列表生成器对象print(soup.head.descendants)#通过遍历获取所有子孙节点forchildinsoup.head.descendants:print(child)

运行结果

<generatorobjectdescendantsat0x00519AB0><title>TheDormouse'sstory</title>TheDormouse'sstory

3.节点内容：.string属性

如果Tag只有一个NavigableString类型子节点，那么这个Tag可以使用.string得到子节点。如果一个Tag仅有一个子节点，那么这个Tab也可以使用.string方法，输出结果与当前唯一子节点的.string结果相同。

通俗点来讲就是：如果一个标签里面没有标签了，那么.string就会返回标签里面的内容。如果标签里面只有唯一的一个标签了，那么.string也会返回里面的内容。例如：

#!/usr/bin/python3#-*-coding:utf-8-*-frombs4importBeautifulSouphtml="""<html><head><title>TheDormouse'sstory</title></head><body><pclass="title"name="dromouse">TheDormouse'sstory<pclass="story">Onceuponatimetherewerethreelittlesisters;andtheirnameswere<ahref="http://example.com/elsie"class="sister"id="link1"></a>,<ahref="http://example.com/lacie"class="sister"id="link2">Lacie</a>and<ahref="http://example.com/tillie"class="sister"id="link3">Tillie</a>;andtheylivedatthebottomofawell.<pclass="story">..."""#创建BeautifulSoup对象，指定lxml解析器soup=BeautifulSoup(html,"lxml")print(soup.head.string)print(soup.head.title.string)

运行结果

TheDormouse'sstoryTheDormouse'sstory

看完了这篇文章，相信你对Python爬虫中遍历文档树的方法有了一定的了解，想了解更多相关知识，欢迎关注亿速云行业资讯频道，感谢各位的阅读！