简体   繁体   English

使用LXML解析HTML数据

[英]Parsing Html data using LXML

<div id="descriptionmodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">Description</h3>
    </div>
    <div id="issue-description" class="mod-content">
        <p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<ul class="alternate" type="square">
    <li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul>

I want only the Q's . 我只想要Q的。 I tried this 我试过了

doc=lh.fromstring(resp.read())  
for id in doc.cssselect('div.mod-content' ):
    print id.text_content()

This gives me the q's but it also gives me other details on the page with class mod-content. 这给了我q,但也给了我页面上其他类为mod-content的细节。 How do i specifically get only the q's. 我怎么只得到q的。

I am using lxml. 我正在使用lxml。

<div id="peoplemodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">People</h3>
    </div>
    <div class="mod-content">
        <ul class="item-details" id="peopledetails">
            <li class="people-details">
                                <dl>
                    <dt>Assignee:</dt>
                    <dd id="Assign-Val">
                                <a class="user-hover" rel="605794069" id="issue_summary_assignee_605794069" href="--------------"> AAAAAAAAAAAAA a>
                    </dd>
                </dl>
                                                <dl>
                    <dt>Reporter:</dt>
                    <dd id="Report-Val">
                                <a class="user-hover" rel="700843051" id="issue_summary_reporter_700843051" href="-------------------------">BBBBBBBBBBBBBB</a>
                    </dd>
                </dl>
                                <dl><dt>&nbsp;</dt><dd>&nbsp;</dd></dl>
                                <dl>
                    <dt title="Multiple Assignees">Multiple Assignees:</dt>
                    <dd id="customfield_10020-val">    <div class="shorten" id="customfield_10020-field">
                                    <span class="tinylink">        <a class="user-hover" rel="604810609" id="multiuser_cf_604810609" href------------------">FFFFFFFFFFFFFF</a></span>,                                                 <span class="tinylink">        <a class="user-hover" rel="600548483" id="multiuser_cf_600548483" href="------------------------------------">EEEEEEEEEEEEEEEEE</a></span>                        </div>
</dd>
                </dl>
                            </li>
        </ul>
                        <div id="watchers-val">
                                                <a href="----------------------------------------" id="watching-toggle" rel="858270" title="Start watching this story"><span class="icon icon-watch-off"></span><span class="action-text">Watch</span></a>


                            (<span id="watcher-data">1</span>)
                    </div>
            </div>
</div>

First off: if you are parsing HTML there is a high chance humans will have messed up with it and it won't validate correctly. 首先,如果您正在解析HTML,那么人们很可能会搞乱它,并且它不会正确验证。 For example this is the case for the example you posted (there are a couple of </div> missing...). 例如,您发布的示例就是这种情况(缺少</div> ……)。 Consider passing to beautifulsoup instead, which is specifically designed to accommodate for these kind of errors. 考虑改用beautifulsoup ,它是专门为解决此类错误而设计的。

That said, if your question is just about how to extract the "textual part of the HTML", or in other words how to convert HTML → plain text [as opposed to "extracting only the text contained in specific HTML containers], this is a minimal working example: 就是说,如果您的问题只是关于如何提取“ HTML的文本部分”,或者换句话说如何转换HTML→纯文本(而不是“仅提取特定HTML容器中包含的文本”),这就是一个最小的工作示例:

from lxml import etree

content = '''<div id="descriptionmodule" class="module toggle-wrap">
    <div class="mod-header">
        <h3 class="toggle-title">Description</h3>
    </div>
    <div id="issue-description" class="mod-content">
        <p>qqqqqqqqqqqqq,<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq<br/>
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</p>

<p>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.</p>

<ul class="alternate" type="square">
    <li>qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq</li>
</ul></div></div>'''

tree = etree.fromstring(content)

for bit in tree.xpath('//text()'):
    if bit.strip():  # you can insert any kind of test here
        print bit

It outputs: 它输出:

Description
qqqqqqqqqqqqq,

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq.
qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq

HTH! HTH!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM