中文版BeautifulSoup库
作用
提取HTML和XML文档中的数据
修改、导航、查找文档
创建html_doc
>>> html_doc = """
... <html><head><title>The Dormouse's story</title></head> ... <body> ... <p class="title"><b>The Dormouse's story</b></p> ... ... <p class="story">Once upon a time there were three little sisters; and their names were ... <a href="" class="sister" id="link1">Elsie</a>, ... <a href="" class="sister" id="link2">Lacie</a> and ... <a href="" class="sister" id="link3">Tillie</a>; ... and they lived at the bottom of a well.</p> ... ... <p class="story">...</p> ... """
#使用bs4库
>>> from bs4 import BeautifulSoup>>> soup = BeautifulSoup(html_doc)>>> print soup.prettify()<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" href="" id="link1"> Elsie </a> , <a class="sister" href="" id="link2"> Lacie </a> and <a class="sister" href="" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body></html>
提取所需的字段
>>> soup.title #提取标题<title>The Dormouse's story</title>>>> soup.title.name'title' >>> soup.title.string #提取标题的内容u"The Dormouse's story" >>> soup.a #提取<a>字段信息(第一个<a>)<a class="sister" href="" id="link1">Elsie</a>>>> soup.p<p class="title"><b>The Dormouse's story</b></p>>>> soup.p['class'] ['title']
查找<a>
>>> soup.find_all('a') [<a class="sister" href="" id="link1">Elsie</a>, <a class="sister" href="" id="link2">Lacie</a>, <a class="sister" href="" id="link3">Tillie</a>]>>> for link in soup.find_all('a'):
... print link.get('href') #提取link, href字段...