简体   繁体   English

使用python,LXML和XPATH从网站上的表中提取信息

[英]Extracting information from a table on a website using python, LXML & XPATH

I managed after lots of hard work to extract some information that i needed from a table from this website: 经过大量的努力,我设法从该网站的表格中提取了一些我需要的信息:

http://gbgfotboll.se/serier/?scr=table&ftid=57108 http://gbgfotboll.se/serier/?scr=table&ftid=57108

From the table "Kommande Matcher"(second table) I managed to extract the date and the team names. 我从“ Kommande Matcher”表(第二张表)中提取了日期和球队名称。

But now i am totally stuck trying to extract from the first table: 但是现在我完全陷入尝试从第一张表中提取的问题:

  • The first column "Lag" 第一列“滞后”

  • The second column "S" 第二列“ S”

  • 6h column "GM-IM" 6h栏“ GM-IM”

  • last column "P" 最后一列“ P”

Any ideas? 有任何想法吗? , Thanks , 谢谢

I've just did it: 我刚刚做到了:

from io import BytesIO
import urllib2 as net
from lxml import etree
import lxml.html    

request = net.Request("http://gbgfotboll.se/serier/?scr=table&ftid=57108")
response = net.urlopen(request)
data = response.read()

collected = [] #list-tuple of [(col1, col2...), (col1, col2...)]
dom = lxml.html.parse(BytesIO(data))
#all table rows    
xpatheval = etree.XPathDocumentEvaluator(dom)
rows = xpatheval('//div[@id="content-primary"]/table[1]/tbody/tr')

for row in rows:
    columns = row.findall("td")
    collected.append((
        columns[0].find("a").text.encode("utf8"), # Lag
        columns[1].text, # S
        columns[5].text, # GM-IM
        columns[7].text, # P - last column
    ))

for i in collected: print i

在此处输入图片说明

You could to pass URL in lxml.html.parse() directly rather than call urllib2. 您可以直接在lxml.html.parse()中传递URL,而不是调用urllib2。 Also, you'd grab target table by class attribute, like this: 另外,您可以按类属性来获取目标表,如下所示:

# new version
from lxml import etree
import lxml.html    

collected = [] #list-tuple of [(col1, col2...), (col1, col2...)]
dom = lxml.html.parse("http://gbgfotboll.se/serier/?scr=table&ftid=57108")
#all table rows
xpatheval = etree.XPathDocumentEvaluator(dom)
rows = xpatheval("""//div[@id="content-primary"]/table[
    contains(concat(" ", @class, " "), " clTblStandings ")]/tbody/tr""")

for row in rows:
    columns = row.findall("td")
    collected.append((
        columns[0].find("a").text.encode("utf8"), # Lag
        columns[1].text, # S
        columns[5].text, # GM-IM
        columns[7].text, # P - last column
    ))

for i in collected: print i

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM