[英]Scraping data python lxml
I'm trying to retrieving a specific string by scraping. 我正在尝试通过抓取来检索特定的字符串。 However it seem to return nothing.
但是它似乎什么也没返回。 i'm using python and lxml, but not seem to return the string inside the a tag.
我正在使用python和lxml,但似乎未在a标签内返回字符串。
here is the html i'm trying to retrieve 这是我要检索的html
<fieldset>
<legend align="center">
<a href="/counterstrike/events/302-cs-go-champions-league">CS:GO Champions League</a>
</legend>
</fieldset>
Here is what i've tried 这是我尝试过的
def get_league(self):
request = requests.get(self.url)
tree = html.fromstring(request.content)
league = tree.xpath("//legend[@class='center']//a")
return league
Use xpath to select the text explicitly 使用xpath明确选择文本
//legend[@align='center']/a/text()
This plugin for chrome helps a lot when writing lxml queries Xpath Helper chrome的此插件在编写lxml查询时很有帮助Xpath Helper
Try this, it's not lxml but you can use it for scraping purposes. 试试看,它不是lxml,但是您可以将其用于抓取目的。 Firstly I'm going to define my own-made function, it'll be easier to scrape then
首先,我要定义自己的函数,然后抓取会更容易
def getBetweenHTML(strSource, strStart,strEnd):
start = strSource.find(strStart) + len(strStart)
end = strSource.find(strEnd,start)
return strSource[start:end]
Afterwards, I'm going to do this: 然后,我将执行此操作:
def get_league(self):
import urllib2
url = urllib2.urlopen(self.url).read()
getBetweenHTML(url, '<a href="/counterstrike/events/302-cs-go-champions-league">',"</a>")
This worked for me, it's just an alternative. 这对我有用,这只是一种选择。 If it's not what you're looking for, tell me and I'll re-write it for lxml.
如果不是您要的内容,请告诉我,我将为lxml重新编写。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.