简体   繁体   English

使用Python和lxml.html解析HTML

[英]Parse Html with Python and lxml.html

I'm creating a Python scraper at scraperwiki.com. 我正在scraperwiki.com创建一个Python刮板。 I need to parse a part of a html page that contains the following code: 我需要解析包含以下代码的html页面的一部分:

<div class="div_class">
    <h3>I'm a title. Don't touch me</h3>
    <ul>
        <li>
        I'm a title. Parse me
            <ul>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
            </ul>
        </li>
        <li>
        I'm a title. Parse me
        <ul>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
            </ul>
        </li>
        <li>
        I'm a title. Parse me
        <ul>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
            </ul>
        </li>
        <li>
        I'm a title. Parse me
        <ul>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
                <li>fdfdsfd</li>
            </ul>
        </li>
    </ul>
</div>

I want to parse only "I'm a title. Parse me" titles. 我只想解析“我是标题。解析我”标题。 Here is how I'm doing it: 这是我的做法:

import scraperwiki
import lxml.html
import re 
import datetime
#.......................
raw_string = lxml.html.fromstring(scraperwiki.scrape(url_to_scrape))
raw_html = raw_string.cssselect("div.div_class ul > li")
for item in ras_html
  print(item.text_content())

I does work. 我上班了 But it captures all the data insile ul. 但是它捕获了所有数据ul。 I don't want it, I want to find only "I'm a title. Parse me" in each ul and that's it. 我不想要它,我只想在每个ul中找到“我是标题。解析我”,仅此而已。

How can I do it? 我该怎么做?

The beauty of the lxml is that you can use both css selectors and xpath to find any element on the page. lxml在于,您可以同时使用css选择器和xpath查找页面上的任何元素。

In your case, since you have nested <ul> lists, it's better to use xpath for navigation: 在您的情况下,由于您嵌套了<ul>列表,因此最好使用xpath进行导航:

# find every <li> in the <ul> under div with class div_class
raw_html = raw_string.xpath("//div[@class='div_class']/ul/li")
for item in raw_html:
    print(item.text.strip())

prints: 打印:

I'm a title. Parse me
I'm a title. Parse me
I'm a title. Parse me
I'm a title. Parse me

Here is the brief explanation of xpath in lxml: http://lxml.de/tutorial.html#using-xpath-to-find-text 这是lxml中xpath的简要说明: http : //lxml.de/tutorial.html#using-xpath-to-find-text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM