简体   繁体   English

从html提取数据

[英]Extract data from html

I have a html document with the structure: 我有一个HTML文档,其结构如下:

<!DOCTYPE html>
<html>
<body>

<p>One</p>
<p>Two</p>
<p>Three</p>

</body>
</html>

Advise module for Python, with which I can make: 为Python提供建议模块,我可以使用该模块:

var = ModuleName.html.bode.p2
print(var)
Two

BeautifulSoup would make it quite close to what you are asking about: BeautifulSoup将使其非常接近您的要求:

from bs4 import BeautifulSoup

soup = BeautifulSoup(data)

print(soup.html.body("p")[1].text)  # prints Two

In other words, the dot here shortcuts to "find", the parenthesis shortcut to "find all". 换句话说,此处的点快捷方式为“查找”,括号中的快捷方式为“查找全部”。

I would recommend you use BeautifulSoup to parse your HTML and extract the content you want with css selectors. 我建议您使用BeautifulSoup解析HTML并使用CSS选择器提取所需的内容。

You can find an example of something very similar to what you want to do in the documentation : http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors 您可以在文档中找到与您想做的事情非常相似的示例: http : //www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

Edit: Here is a snippet of code since the documentation has a typo and it ommits the ":" in the selector string. 编辑:这是代码段,因为文档有错别字,并且省略了选择器字符串中的“:”。

from bs4 import BeautifulSoup

data = "<!DOCTYPE html> <html> <body><p>One</p><p>Two</p><p>Three</p></body></html>"

soup = BeautifulSoup(data, 'html.parser')
print soup.body.select("p:nth-of-type(2)")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM