简体   繁体   English

使用Python从解析的HTML中提取文本

[英]Extracting Text from Parsed HTML with Python

I'm new to Python and I have been trying to search through html with regular expressions that has been parsed with BeautifulSoup. 我是Python的新手,我一直在尝试使用经过BeautifulSoup解析的正则表达式搜索html。 I haven't had any success and I think the reason is that I don't completely understand how to set up the regular expressions properly. 我没有取得任何成功,我认为原因是我不完全了解如何正确设置正则表达式。 I've looked at older questions about similar problems but I still haven't figured it out. 我已经看过有关类似问题的较早问题,但我仍然没有弄清楚。 If somebody could extract the "/torrent/32726/0/" and "Slackware Linux 13.0 [x86 DVD ISO]" as well as a detailed expression of how the regular expression works, it would be really helpful. 如果有人可以提取“ / torrent / 32726/0 /”和“ Slackware Linux 13.0 [x86 DVD ISO]”以及正则表达式的详细表达方式,那将非常有帮助。

<td class="name">
  <a href="/torrent/32726/0/">
   Slackware Linux 13.0 [x86 DVD ISO]
  </a>
 </td>

Edit: What I meant to say is, I am trying to extract "/torrent/32726/0/" and "Slackware Linux 13.0 [x86 DVD ISO]" using BeautifulSoups functions to search the parse tree. 编辑:我的意思是,我正在尝试使用BeautifulSoups函数提取“ / torrent / 32726/0 /”和“ Slackware Linux 13.0 [x86 DVD ISO]”来搜索语法分析树。 I've been trying various things after searching and reading the documentation, but I'm still not sure on how to go about it. 在搜索和阅读文档之后,我一直在尝试各种方法,但是我仍然不确定如何去做。

BeautifulSoup could also extract node values from your html. BeautifulSoup还可以从您的html中提取节点值。

from BeautifulSoup import BeautifulSoup

html = ('<html><head><title>Page title</title></head>'
       '<body>'
       '<table><tr>'
       '<td class="name"><a href="/torrent/32726/0/">Slackware Linux 13.0 [x86 DVD ISO]</a></td>'
       '<td class="name"><a href="/torrent/32727/0/">Slackware Linux 14.0 [x86 DVD ISO]</a></td>'
       '<td class="name"><a href="/torrent/32728/0/">Slackware Linux 15.0 [x86 DVD ISO]</a></td>'
       '</tr></table>'
       'body'
       '</html>')
soup = BeautifulSoup(html)
links = [td.find('a') for td in soup.findAll('td', { "class" : "name" })]
for link in links:
    print link.string

Output: 输出:

Slackware Linux 13.0 [x86 DVD ISO]  
Slackware Linux 14.0 [x86 DVD ISO]  
Slackware Linux 15.0 [x86 DVD ISO]  

You could use lxml.html to parse the html document: 您可以使用lxml.html解析html文档:

from lxml import html

doc = html.parse('http://example.com')

for a in doc.cssselect('td a'):
    print a.get('href')
    print a.text_content()

You will have to look at how the document is structured to find the best way of determining the links you want (there might be other tables with links in them that you do not need etc...): you might first want to find the right table element for instance. 您将必须查看文档的结构,以找到确定所需链接的最佳方法(可能有其他不需要链接的表,等等):您可能首先想找到例如右table元素。 There are also options besides css selectors (xpath for example) to search the document/the element. 除了CSS选择器(例如xpath)以外,还有其他选项可以搜索文档/元素。

If you need, you can turn the links into absolute links with .make_links_absolute() method (do it on the document after parsing, and all the url's will be absolute, very convenient) 如果需要,可以使用.make_links_absolute()方法将链接转换为绝对链接.make_links_absolute()解析后在文档上进行操作,所有url都是绝对的,非常方便)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM