[英]extracting data using lxml and request and xpath in Python from a website
[英]Extracting information from a table on a website using python, LXML & XPATH
經過大量的努力,我設法從該網站的表格中提取了一些我需要的信息:
http://gbgfotboll.se/serier/?scr=table&ftid=57108
我從“ Kommande Matcher”表(第二張表)中提取了日期和球隊名稱。
但是現在我完全陷入嘗試從第一張表中提取的問題:
第一列“滯后”
第二列“ S”
6h欄“ GM-IM”
最后一列“ P”
有任何想法嗎? , 謝謝
我剛剛做到了:
from io import BytesIO
import urllib2 as net
from lxml import etree
import lxml.html
request = net.Request("http://gbgfotboll.se/serier/?scr=table&ftid=57108")
response = net.urlopen(request)
data = response.read()
collected = [] #list-tuple of [(col1, col2...), (col1, col2...)]
dom = lxml.html.parse(BytesIO(data))
#all table rows
xpatheval = etree.XPathDocumentEvaluator(dom)
rows = xpatheval('//div[@id="content-primary"]/table[1]/tbody/tr')
for row in rows:
columns = row.findall("td")
collected.append((
columns[0].find("a").text.encode("utf8"), # Lag
columns[1].text, # S
columns[5].text, # GM-IM
columns[7].text, # P - last column
))
for i in collected: print i
您可以直接在lxml.html.parse()中傳遞URL,而不是調用urllib2。 另外,您可以按類屬性來獲取目標表,如下所示:
# new version
from lxml import etree
import lxml.html
collected = [] #list-tuple of [(col1, col2...), (col1, col2...)]
dom = lxml.html.parse("http://gbgfotboll.se/serier/?scr=table&ftid=57108")
#all table rows
xpatheval = etree.XPathDocumentEvaluator(dom)
rows = xpatheval("""//div[@id="content-primary"]/table[
contains(concat(" ", @class, " "), " clTblStandings ")]/tbody/tr""")
for row in rows:
columns = row.findall("td")
collected.append((
columns[0].find("a").text.encode("utf8"), # Lag
columns[1].text, # S
columns[5].text, # GM-IM
columns[7].text, # P - last column
))
for i in collected: print i
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.