简体   繁体   English

使用 xpath 抓取网页内容时获取空列表

[英]Getting empty list when scraping web page content using xpath

When I try to retrieve some data using xpath from the url in the following code I get an empty list:当我尝试使用 xpath 从以下代码中的 url 检索一些数据时,我得到一个空列表:

from lxml import html
import requests

if __name__ == '__main__':
    url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'

    page = requests.get(url)
    tree = html.fromstring(page.content)

    # XPath to get the XP
    print(tree.xpath('//*[@id="graphDD1"]/text()'))
>>> []

What I expect is a string value like this one:我期望的是这样的字符串值:

>>> ['
        5.0%    ']

This is because the xpath element that you are searching for is within some JavaScript.这是因为您要搜索的 xpath 元素位于某个 JavaScript 中。

You will need to find out the cookie which is generated after the JavaScript has been called so that you can make the same call to the URL.您需要找出调用 JavaScript 后生成的 cookie,以便您可以对 URL 进行相同的调用。

  1. Go to the 'Network' page of the Dev Console转到开发控制台的“网络”页面
  2. Find the difference in the request header after abg_lite.js has run (mine was cookie: __cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0- AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ 70I= )之后发现在请求报头中的差分abg_lite.js已用完(矿是cookie: __cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0- AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ 70I=
  3. Add the cookie to your request将 cookie 添加到您的请求中
from lxml import html
import requests

if __name__ == '__main__':
    url = 'https://www.leagueofgraphs.com/champions/stats/aatrox'

    # Create a session to add cookies and headers to
    s = requests.Session()

    # After finding the correct cookie, update your sessions cookie jar
    # add your own cookie here
    s.cookies['cookie'] = '__cf_bm=TtnYbPlIA0J_GOhNj2muKa1pi8pU38iqA3Yglaua7q8-1636535361-0-'
'AQcpStbhEdH3oPnKSuPIRLHVBXaqVwo+zf6d3YI/rhmk/RvN5B7OaIcfwtvVyR0IolwcoCk4ClrSvbBP4DVJ70I='

    # Update headers to spoof a regular browser; this may not be necessary
    # but is good practice to bypass any basic bot detection
    s.headers.update({
                'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
' AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36'
            })

    page = s.get(url)
    tree = html.fromstring(page.content)

    # XPath to get the XP
    print(tree.xpath('//*[@id="graphDD1"]/text()'))

The following output is achieved: -实现了以下输出: -

['\\r\\n 5.0% ']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM