简体   繁体   中英

Why am I receiving an empty array in python using xpath lxml

I want to scrape from this page: https://www.leagueofgraphs.com/summoner/na/samrick41#championsData-soloqueue to get a specific winrate value for a role.

import requests
from lxml import html

url = 'https://www.leagueofgraphs.com/summoner/na/samrick41#championsData-soloqueue'
headers = {my headers here}
page = requests.get(url, headers=headers)
contents = page.content

tree = html.fromstring(contents)

print (tree.xpath('//*[@id="profileRoles"]/div[2]/div[2]/table/tbody/tr[2]/td[3]/a/progressbar/div[2]/text()'))

[]

I get an empty array in response. I think I need to remove "tbody", because at least I get an element up to the "progressbar" node, not sure why. But from there why can't I get the percent value with the last "div[2]". I'm sure there are other ways to get the value I'm looking for, but I feel like this should work, so I'm not understanding something here if anyone can enlighten me, thanks.

You're getting the right response but the HTML you want is actually loaded via javascript. You can see this when you disable javascript in the browser,you wont any child of progressbar in the html.

In chrome you can easily do disable javascript by inspecting the page, right hand side has three dots --> more tools --> settings -> Scroll down to debugger. Infact I always do this before attempting any scraping, often the DOM is being manipulated by javascript if theres any functionality in the website.

You don't get the nice neat image with the numbers. Having said that, the information you want is actually in the progressbar data-value attribute.

import requests
from lxml import html

url = 'https://www.leagueofgraphs.com/summoner/na/samrick41#championsData-soloqueue'

headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Accept-Language': 'en-US,en;q=0.9',
}
page = requests.get(url, headers=headers)
contents = page.content

tree = html.fromstring(contents)

for a in tree.xpath('//td[3]/a/progressbar'):
    winrate = a.get('data-value')
    print('Winrate: ',round(float(winrate)*100,1),'%')

Output

Winrate:  52.0 %
Winrate:  45.5 %
Winrate:  37.5 %
Winrate:  100.0 %
Winrate:  0.0 %
Winrate:  0.0 %
Winrate:  0.0 %
...

I'll admit I've been lazy as I'm not sure what your precise data needs are but, this will get you a bit further.

The values come out as 2dp values so there was a need to convert this into %, the round() function I'm using to round up to one decimal places, we have to convert the string we get from the xpath selector to a float value inorder to manipulate it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM