简体   繁体   中英

Parsing HTML table with LXML in Python

I need to parse html table of the following structure:

<table class="table1" width="620" cellspacing="0" cellpadding="0" border="0">
 <tbody>
   <tr width="620">
     <th width="620">Smth1</th>
     ...
   </tr>
   <tr bgcolor="ffffff" width="620">
     <td width="620">Smth2</td>
     ...
   </tr>
   <tr bgcolor="E4E4E4" width="620">
     <td width="620">Smth3</td>
     ...
   </tr>
   <tr bgcolor="ffffff" width="620">
     <td width="620">Smth4</td>
     ...
   </tr>
 </tbody>
</table>

Python code:

r = requests.post(url,data)
html = lxml.html.document_fromstring(r.text)
rows = html.xpath(xpath1)[0].findall("tr")
#Getting Xpath with FireBug
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

But I get this on the third line:

IndexError: list index out of range

The task is to form python dict from this. Number of rows could be different.

UPD. Changed the way I'm getting html code to avoid possible problems with requests lib. Now it's a simple url:

html = lxml.html.parse(test_url)

This proves everyting is Ok with html:

lxml.html.open_in_browser(html)

But still the same problem:

rows = html.xpath(xpath1)[0].findall('tr')
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

Here is the xpath1:

'/html/body/table/tbody/tr[5]/td/table/tbody/tr/td[2]/table/tbody/tr/td/center/table'

UPD2. It was found experimentally, that xpath crashes on:

xpath1 = '/html/body/table/tbody'
print html.xpath(xpath1)
#print returns []

If xpath1 is shorter, then it seeem to work well and returns [<Element table at 0x2cbadb0>] for xpath1 = '/html/body/table'

You didn't include the XPath, so I'm not sure what you're trying to do, but if I understood correctly, this should work

xpath1 = "tbody/tr"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
rows = html.xpath(xpath1)
data = list()
for row in rows:
    data.append([c.text for c in row.getchildren()])

This is making a list of one item lists though, like this:

[['Smth1'], ['Smth2'], ['Smth3'], ['Smth4']]

To have a simple list of the values, you can use this code

xpath1 = "tbody/tr/*/text()"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
data = html.xpath(xpath1)

This is all assuming that r.text is exactly what you posted up there.

Your .xpath(xpath1) XPath expression failed to find any elements. Check that expression for errors.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM