简体   繁体   English

如何使用lxml对照变量列表解析HTML表?

[英]How to parse HTML table against a list of variables using lxml?

I am trying to parse an HTML table using lxml. 我正在尝试使用lxml解析HTML表。 While rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()') fetches the results, I am trying to extract the column contents only when it starts with a variable in my config file. rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')获取结果时,我试图仅在其以我的变量开头时提取列内容配置文件。 For instance, if a <td> starts with 'Street 1', I then want to grab the <span> contents of that <td> tag. 例如,如果<td>以'Street 1'开头,那么我想获取该<td>标记的<span>内容。 This way, I can have a tuple of tuples (which takes care of the None values) which I can then store in the database. 这样,我可以有一个元组(负责None值),然后可以将其存储在数据库中。

lxml_parse.py lxml_parse.py

import lxml.html as lh

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

rows = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')
print rows

test.htm 测试文件

<tr>

    <td></td>

    <td colspan="2">

        Street 1:<span class="required"> *</span><br />

        <span class="boldred">2100 5th Ave</span>

    </td>

    <td colspan="2">

        Street 2:<br />

        <span class="boldred">Ste 202</span>

    </td>

</tr>

<tr>

    <td></td>

    <td>

        City:<span class="required"> *</span><br />

        <span class="boldred">NYC</span>

    </td>

    <td>

        State:<br />

        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>

    </td>

    <td>

        Country:<span class="required"> *</span><br />

        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>

    </td>

    <td>

        Zip:<br />

        <span class="boldred">10022</span>

    </td>

</tr>

Output : 输出:

$ python lxml_parse.py 
['2100 5th Ave', 'Ste 202', 'NYC', 'NY', 'USA', '10022']

Parse against a bunch of variables is what I am having problems with : 我遇到的问题是针对一堆变量进行解析:

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = ((var, outhtml.xpath('//tr/td[child::*[text()=var]]/span[@class="boldred"]/text()')) for var in desiredvars)
print myresultset

Aiming to produce this dictionary: 旨在制作此字典:

{'City:': 'NYC', 
 'Zip:': '10022', 
 'Street 1:': '2100 5th Ave', 
 'Country:': 'USA', 
 'State:': 'NY', 
 'Street 2:': 'Ste 202'}

You can use this code. 您可以使用此代码。 And then it is easy to query the dictionary to get the values you desire: 然后可以很容易地查询字典以获取所需的值:

import lxml.html as lh

test = '''<tr>
    <td></td>
    <td colspan="2">
        Street 1:<span class="required"> *</span><br />
        <span class="boldred">2100 5th Ave</span>
    </td>
    <td colspan="2">
        Street 2:<br />
        <span class="boldred">Ste 202</span>
    </td>
</tr>
<tr>
    <td></td>
    <td>
        City:<span class="required"> *</span><br />
        <span class="boldred">NYC</span>
    </td>
    <td>
        State:<br />
        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">NY</SPAN>
    </td>
    <td>
        Country:<span class="required"> *</span><br />
        <SPAN CLASS="boldred2"></SPAN><br/><SPAN CLASS="boldred">USA</SPAN>
    </td>
    <td>
        Zip:<br />
        <span class="boldred">10022</span>
    </td>
</tr>'''

outhtml = lh.fromstring(test)
ks = [ k.strip() for k in outhtml.xpath('//tr/td/text()') if k.strip() != '' ]
vs = outhtml.xpath('//tr/td/span[@class="boldred"]/text()')

result = dict( zip(ks,vs) )

print result

lxml_tempsofsol.py : lxml_tempsofsol.py

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = ((var, outhtml.xpath('//tr/td[contains(text(), "%s")]/span[@class="boldred"]/text()'%(var))[0]) for var in desiredvars)

for each in myresultset:
    print each

Output : 输出:

$ python lxml_tempsofsol.py
('Street 1', '2100 5th Ave')
('Street 2', 'Ste 202')
('City', 'NYC')
('State', 'NY')
('Zip', '10022')

I've searched for the same thing and found your question and no "right" answer so I'll add a couple of points: 我搜索了同样的东西,发现了您的问题,没有“正确的”答案,因此我要补充几点:

  • To refer to variables in XPath you should use $var syntax , 要在XPath中引用变量,您应该使用$ var语法
  • In lxml variables are passed as keyword arguments to xpath() , 在lxml中,变量作为关键字参数传递给xpath()
  • Using child::* is wrong since you search for text directly within <td/> ; 使用child::*是错误的,因为您直接在<td/>内搜索文本; text() already searches for text child nodes, text()已搜索文本子节点,
  • You need to use contains() XPath function due to whitespace. 由于空格,您需要使用contains() XPath函数。

Taking those into account your corrected code looks like this: 考虑到这些因素,您的更正代码如下所示:

import lxml.html as lh

desiredvars = ['Street 1','Street 2','City', 'State', 'Zip']

doc=open('test.htm', 'r')
outhtml=lh.parse(doc)
doc.close()

myresultset = [(var, outhtml.xpath('//tr/td[contains(text(), $var)]/span[@class="boldred"]/text()', var=var)) for var in desiredvars]
print myresultset

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM