简体   繁体   English

python使用lxml和xpath解析html表上的特定数据

[英]python parse specific data on html table using lxml and xpath

First of all I am new to python and Stack Overflow so please be kind. 首先,我是python和Stack Overflow的新手,所以请善待。

This is the source code of the html page I want to extract data from. 这是我要从中提取数据的html页面的源代码。

Webpage: http://gbgfotboll.se/information/?scr=table&ftid=51168 The table is at the bottom of the page 网页: http//gbgfotboll.se/information/?scr = table&ftid = 51168该表位于页面底部

  <html>
        table class="clCommonGrid" cellspacing="0">
                <thead>
                    <tr>
                        <td colspan="3">Kommande matcher</td>
                    </tr>
                    <tr>
                        <th style="width:1%;">Tid</th>
                        <th style="width:69%;">Match</th>
                        <th style="width:30%;">Arena</th>
                    </tr>
                </thead>

                <tbody class="clGrid">

            <tr class="clTrOdd">
                <td nowrap="nowrap" class="no-line-through">
                    <span class="matchTid"><span>2014-09-26<!-- br ok --> 19:30</span></span>



                </td>
                <td><a href="?scr=result&amp;fmid=2669197">Guldhedens IK - IF Warta</a></td>
                <td><a href="?scr=venue&amp;faid=847">Guldheden Södra 1 Konstgräs</a> </td>
            </tr>

            <tr class="clTrEven">
                <td nowrap="nowrap" class="no-line-through">
                    <span class="matchTid"><span>2014-09-26<!-- br ok --> 13:00</span></span>



                </td>
                <td><a href="?scr=result&amp;fmid=2669176">Romelanda UF - IK Virgo</a></td>
                <td><a href="?scr=venue&amp;faid=941">Romevi 1 Gräs</a> </td>
            </tr>

            <tr class="clTrOdd">
            <td nowrap="nowrap" class="no-line-through">
                <span class="matchTid"><span>2014-09-27<!-- br ok --> 13:00</span></span>



            </td>
            <td><a href="?scr=result&amp;fmid=2669167">Kode IF - IK Kongahälla</a></td>
            <td><a href="?scr=venue&amp;faid=912">Kode IP 1 Gräs</a> </td>
        </tr>

        <tr class="clTrEven">
            <td nowrap="nowrap" class="no-line-through">
                <span class="matchTid"><span>2014-09-27<!-- br ok --> 14:00</span></span>



            </td>
            <td><a href="?scr=result&amp;fmid=2669147">Floda BoIF - Partille IF FK </a></td>
            <td><a href="?scr=venue&amp;faid=218">Flodala IP 1</a> </td>
        </tr>


                </tbody>
        </table>
    </html>

I need to extract the time: 19:30 and the team name: Guldhedens IK - IF Warta meaning the first and the second table cell(not the third) from the first table row and 13:00/Romelanda UF - IK Virgo from the second table row etc.. from all the table rows there is. 我需要提取时间:19:30和团队名称:Guldhedens IK - IF Warta意味着第一个和第二个表格单元格(不是第三个)来自第一个表格行和13:00 / Romelanda UF - IK Virgo来自第二个表行等。来自所有表行。

As you can see every table row has a date right before the time so here comes the tricky part. 正如你所看到的那样,每一个表行都有一个日期,所以这里有一个棘手的部分。 I only want to get the time and the team names as mentioned above from those table rows where the date is equal to the date I run this code. 我只想从那些日期等于我运行此代码的日期的表行中获取上面提到的时间和团队名称。

The only thing I managed to do so far is not much, I can only get the time and the team name using this code: 到目前为止我唯一能做到的事情并不多,我只能使用以下代码获取时间和团队名称:

import lxml.html
html = lxml.html.parse("http://gbgfotboll.se/information/?scr=table&ftid=51168")
test=html.xpath("//*[@id='content-primary']/table[3]/tbody/tr[1]/td[1]/span/span//text()")

print test

which gives me the result ['2014-09-26', ' 19:30'] after this I'm lost on how to iterate through different table rows wanting the specific table cells where the date matches the date I run the code. 这给了我结果['2014-09-26','19:30']之后,我迷失了如何遍历不同的表行,想要特定的表格单元格,其中日期与我运行代码的日期相匹配。

I hope you can answer as much as you can. 我希望你能尽可能多地回答。

If I understood you, try something like this: 如果我理解你,尝试这样的事情:

import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
for i in range(12):
    xpath1 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
    xpath2 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
    print html.xpath(xpath1)[1], html.xpath(xpath2)[0]

I know this is fragile and there are better solutions, but it works. 我知道这很脆弱,有更好的解决方案,但它确实有效。 ;) ;)

Edit: 编辑:
Better way with using BeautifulSoup: 使用BeautifulSoup更好的方法:

from bs4 import BeautifulSoup
import requests

respond = requests.get("http://gbgfotboll.se/information/?scr=table&ftid=51168")
soup = BeautifulSoup(respond.text)
l = soup.find_all('table')
t = l[2].find_all('tr') #change this to [0] to parse first table
for i in t:
    try:
        print i.find('span').get_text()[-5:], i.find('a').get_text()
    except AttributeError:
        pass

Edit2: page not responding, but that should work: Edit2:页面没有响应,但应该有效:

from bs4 import BeautifulSoup
import requests

respond = requests.get("http://gbgfotboll.se/information/?scr=table&ftid=51168")
soup = BeautifulSoup(respond.text)
l = soup.find_all('table')
t = l[2].find_all('tr')
time = ""
for i in t:
    try:
        dateTime = i.find('span').get_text()
        teamName = i.find('a').get_text()
        if time == dateTime[:-5]:
            print dateTime[-5,], teamName
        else:
            print dateTime, teamName
            time = dateTime[:-5]
    except AttributeError:
        pass

lxml: LXML:

import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
dateTemp = ""
for i in range(12):
    xpath1 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//      text()" %(i+1)
    xpath2 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
    time = html.xpath(xpath1)[1]
    date = html.xpath(xpath1)[0]
    teamName = html.xpath(xpath2)[0]
    if date == dateTemp:
        print time, teamName
    else:
        print date, time, teamName

So thanks to @CodeNinja help i just tweaked it a little bit to get exactly what i wanted. 所以感谢@CodeNinja的帮助,我只是稍微调整一下以获得我想要的东西。 I imported time to get the date of the time i run the code. 我导入时间来获取运行代码的日期。 Anyways here is the code for what i wanted. 无论如何这里是我想要的代码。 Thank you for the help!! 感谢您的帮助!!

import lxml.html
import time
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
currentDate = (time.strftime("%Y-%m-%d"))
for i in range(12):
    xpath1 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
    xpath2 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
    time = html.xpath(xpath1)[1]
    date = html.xpath(xpath1)[0]
    teamName = html.xpath(xpath2)[0]
    if date == currentDate:
        print time, teamName

So here is the FINAL version of how to do it the correct way. 所以这里是如何以正确的方式做到这一点的最终版本。 This will parse through all the table rows it has without using "range" in the for loop. 这将解析它所拥有的所有表行,而不使用for循环中的“range”。 I got this answer from my other post here: Iterate through all the rows in a table using python lxml xpath 我在这里从其他帖子得到了这个答案: 使用python lxml xpath迭代表中的所有行

import lxml.html
from lxml.etree import XPath
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
date = '2014-09-27'

rows_xpath = XPath("//*[@id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
time_xpath = XPath("td[1]/span/span//text()[2]")
team_xpath = XPath("td[2]/a/text()")

html = lxml.html.parse(url)

for row in rows_xpath(html):
    time = time_xpath(row)[0].strip()
    team = team_xpath(row)[0]
    print time, team

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM