[英]Parse Html with Python and lxml.html
I'm creating a Python scraper at scraperwiki.com. 我正在scraperwiki.com创建一个Python刮板。 I need to parse a part of a html page that contains the following code:
我需要解析包含以下代码的html页面的一部分:
<div class="div_class">
<h3>I'm a title. Don't touch me</h3>
<ul>
<li>
I'm a title. Parse me
<ul>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
</ul>
</li>
<li>
I'm a title. Parse me
<ul>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
</ul>
</li>
<li>
I'm a title. Parse me
<ul>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
</ul>
</li>
<li>
I'm a title. Parse me
<ul>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
</ul>
</li>
</ul>
</div>
I want to parse only "I'm a title. Parse me" titles. 我只想解析“我是标题。解析我”标题。 Here is how I'm doing it:
这是我的做法:
import scraperwiki
import lxml.html
import re
import datetime
#.......................
raw_string = lxml.html.fromstring(scraperwiki.scrape(url_to_scrape))
raw_html = raw_string.cssselect("div.div_class ul > li")
for item in ras_html
print(item.text_content())
I does work. 我上班了 But it captures all the data insile ul.
但是它捕获了所有数据ul。 I don't want it, I want to find only "I'm a title. Parse me" in each ul and that's it.
我不想要它,我只想在每个ul中找到“我是标题。解析我”,仅此而已。
How can I do it? 我该怎么做?
The beauty of the lxml
is that you can use both css selectors and xpath to find any element on the page. lxml
在于,您可以同时使用css选择器和xpath查找页面上的任何元素。
In your case, since you have nested <ul>
lists, it's better to use xpath for navigation: 在您的情况下,由于您嵌套了
<ul>
列表,因此最好使用xpath进行导航:
# find every <li> in the <ul> under div with class div_class
raw_html = raw_string.xpath("//div[@class='div_class']/ul/li")
for item in raw_html:
print(item.text.strip())
prints: 打印:
I'm a title. Parse me
I'm a title. Parse me
I'm a title. Parse me
I'm a title. Parse me
Here is the brief explanation of xpath in lxml: http://lxml.de/tutorial.html#using-xpath-to-find-text 这是lxml中xpath的简要说明: http : //lxml.de/tutorial.html#using-xpath-to-find-text
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.