简体   繁体   中英

How do I find the right element for my web-scraping?

I'm trying to get get as much information as possible from the top 25 most valuable player on Transfermarkt. I've managed to get some information (with help from colleagues and stackoverflow), and now I'm trying to get the players position , which I find quite hard, since it looks different (in my opinion), from other elements. I'm a beginner at this, so any source material or direct help with code is helpful. Link to website where I'm scraping: Transfermarkt

I've tried reaching the elements through different paths, but I can't seem to get it. I've read about bs4 on crummy.com and looked at other transfermarkt examples here at stackoverflow, but my bad knowledge about coding is giving me troubles. I'm testing with different types of elements outside my main code, to see if I get the right result.

My test code looks like this and the print does not give anything.

import requests
from bs4 import BeautifulSoup
import re

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
r = requests.get(
    "https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop", headers=headers)

soup = BeautifulSoup(r.text, 'html.parser')

for position in soup.find_all("td",class_="inline_table"):
    print(position)

Working from VarKas answer, but re-aligning to your original attempt, if you look for 'table' with the class 'inline-table' it grabs the "mini tables" with a player's name and position as row 1 and 2 respectively:

for table in soup.find_all('table', attrs={'class': 'inline-table'}):
    content = table.contents
    print(content[0].text)  # Name
    print(content[1].text)  # Position

In addition, if you wanted to look up more than the top 25, you can flick through all the pages of the table (there are 20) by adding '?page=' to the URL:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}

pages = range(1, 20)

for page in pages:

    r = requests.get(
        "https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop?page=%d" % page, headers=headers)

    soup = BeautifulSoup(r.text, 'html.parser')
    pretty = soup.prettify()

    for table in soup.find_all('table', attrs={'class': 'inline-table'}):
        content = table.contents
        print(content[0].text)  # Name
        print(content[1].text)  # Position

use this code

import requests
from bs4 import BeautifulSoup
import re

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
r = requests.get(
    "https://www.transfermarkt.co.uk/spieler-statistik/wertvollstespieler/marktwertetop", headers=headers)

soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find_all("table", {"class": "inline-table"})
# table[0] ---> Mbape Data
# table[1] --->Raheem Sterling Data
# table[2] ---> Neymar Data
print(table[0].find_all('a')[1].get_text())  # Mbape Name

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM