简体   繁体   English

抓取数据python lxml

[英]Scraping data python lxml

I'm trying to retrieving a specific string by scraping. 我正在尝试通过抓取来检索特定的字符串。 However it seem to return nothing. 但是它似乎什么也没返回。 i'm using python and lxml, but not seem to return the string inside the a tag. 我正在使用python和lxml,但似乎未在a标签内返回字符串。

here is the html i'm trying to retrieve 这是我要检索的html

<fieldset>
    <legend align="center">
        <a href="/counterstrike/events/302-cs-go-champions-league">CS:GO Champions League</a>
    </legend>
</fieldset>

Here is what i've tried 这是我尝试过的

def get_league(self):
    request = requests.get(self.url)
    tree = html.fromstring(request.content)
    league = tree.xpath("//legend[@class='center']//a")
    return league

Use xpath to select the text explicitly 使用xpath明确选择文本

//legend[@align='center']/a/text()

This plugin for chrome helps a lot when writing lxml queries Xpath Helper chrome的此插件在编写lxml查询时很有帮助Xpath Helper

Try this, it's not lxml but you can use it for scraping purposes. 试试看,它不是lxml,但是您可以将其用于抓取目的。 Firstly I'm going to define my own-made function, it'll be easier to scrape then 首先,我要定义自己的函数,然后抓取会更容易

def getBetweenHTML(strSource, strStart,strEnd):
    start = strSource.find(strStart) + len(strStart)
    end = strSource.find(strEnd,start)
    return strSource[start:end]

Afterwards, I'm going to do this: 然后,我将执行此操作:

def get_league(self):
    import urllib2
    url = urllib2.urlopen(self.url).read()
    getBetweenHTML(url, '<a href="/counterstrike/events/302-cs-go-champions-league">',"</a>")

This worked for me, it's just an alternative. 这对我有用,这只是一种选择。 If it's not what you're looking for, tell me and I'll re-write it for lxml. 如果不是您要的内容,请告诉我,我将为lxml重新编写。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM