[英]BeautifulSoup Only Retrieves 1st element
I'm currently trying to webscrap some website.我目前正在尝试 webscrap 一些网站。
Here is part of my code:这是我的代码的一部分:
import pandas as pd
from bs4 import BeautifulSoup
import requests
#Get the source code
r = requests.get('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard')
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())
for a in soup:
print (soup.find("td", {"data-stat" : "avg_age"}).text)
Basically, I have the whole source code inside "soup".基本上,我在“汤”中有完整的源代码。 However, when I call elements such as "td", {"data-stat": "avg_age"} I only get repeated result of the first row {"data-row":"0"} as an output:但是,当我调用诸如“td”,{“data-stat”:“avg_age”}之类的元素时,我只会将第一行{“data-row”:“0”}的重复结果作为output:
29.1
29.1
29.1
29.1
29.1
So here are my questions:所以这是我的问题:
-> Why my code is stuck in first row while there is no preselection in my "soup" variable? -> 为什么我的代码卡在第一行,而我的“汤”变量中没有预选?
-> Is there a way to make a loop that could check all the wanted elements for a different row each time? -> 有没有办法制作一个循环,每次都可以检查所有想要的元素是否有不同的行? "data-row":"0" to "data-row":19 for instance.例如,“数据行”:“0”到“数据行”:19。
Thanks for your support and have a great day !感谢您的支持,祝您有美好的一天!
It's stuck in the first row for a couple of reasons:它卡在第一行有几个原因:
.find()
which only returns the first element it "finds" in the html soup object.您正在使用.find()
,它只返回它在 html 汤 object 中“找到”的第一个元素。soup.find("td", {"data-stat": "avg_age"}).text
will always return the same thing. soup.find("td", {"data-stat": "avg_age"}).text
将始终返回相同的内容。 Look at your loop.看看你的循环。Essentially this would be the same logic as you have there:本质上,这与您在那里的逻辑相同:
for x in [1, 2, 3, 4]:
print(1)
As it iterates through that list, it's just going to print 1
and you will get the 1
4 times in your console.当它遍历该列表时,它只会打印1
并且您将在控制台中获得1
4 次。
You need to get all the rows in soup
with soup.find_all('tr')
.您需要使用soup.find_all('tr')
获取soup
中的所有行。 Then when you iterate, if there is a <td>
class with attribute data-stat="avg_age"
, only then do you want to .find()
it and get the text.然后当你迭代时,如果有一个<td>
class 属性data-stat="avg_age"
,那么你才想要.find()
它并获取文本。
import pandas as pd
from bs4 import BeautifulSoup
import requests
#Get the source code
r = requests.get('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard')
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())
rows = soup.find_all('tr')
for a in rows:
if a.find("td", {"data-stat" : "avg_age"}):
print (a.find("td", {"data-stat" : "avg_age"}).text)
Output: Output:
29.1
26.8
29.4
26.8
27.8
26.2
27.2
25.8
26.0
26.9
24.8
25.5
26.9
25.9
27.6
24.5
26.3
28.8
25.6
26.7
26.1
28.2
26.9
26.6
26.0
27.7
28.0
26.8
29.9
25.5
27.1
27.1
27.1
27.2
27.0
27.0
25.1
25.8
25.9
25.8
Just as note, pandas
' .read_html()
uses bs4 under the hood to parse <table>
tags.请注意, pandas
' .read_html()
在后台使用 bs4 来解析<table>
标签。 Use that.用那个。 It's fair more easier.这更容易。
import pandas as pd
df = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0]
Output: Output:
print(df)
Équipe # JC Âge Poss MJ ... xG.1 xA.1 xG+xA npxG.1 npxG+xA.1
0 Ajaccio 18 29.1 34.5 2 ... 0.59 0.14 0.73 0.20 0.34
1 Angers 18 26.8 55.0 2 ... 1.00 0.49 1.49 1.00 1.49
2 Auxerre 15 29.4 39.5 2 ... 0.43 0.43 0.85 0.43 0.85
3 Brest 18 26.8 42.5 2 ... 0.63 0.23 0.86 0.23 0.47
4 Clermont Foot 18 27.8 48.5 2 ... 0.17 0.07 0.24 0.17 0.24
5 Lens 16 26.2 63.0 2 ... 1.48 0.94 2.41 1.08 2.02
6 Lille 18 27.2 65.0 2 ... 2.02 1.65 3.66 2.02 3.66
7 Lorient 14 25.8 36.0 1 ... 0.37 0.26 0.63 0.37 0.63
8 Lyon 15 26.0 68.0 1 ... 1.52 0.49 2.00 0.73 1.22
9 Marseille 17 26.9 55.0 2 ... 1.10 0.89 1.99 1.10 1.99
10 Monaco 19 24.8 40.5 2 ... 2.75 1.21 3.96 2.36 3.57
11 Montpellier 19 25.5 47.5 2 ... 0.93 0.66 1.59 0.93 1.59
12 Nantes 16 26.9 40.5 2 ... 1.37 0.60 1.97 1.37 1.97
13 Nice 18 25.9 54.0 2 ... 0.49 0.40 0.88 0.49 0.88
14 Paris S-G 18 27.6 60.0 2 ... 3.05 1.76 4.81 2.27 4.03
15 Reims 18 24.5 43.0 2 ... 0.54 0.42 0.96 0.54 0.96
16 Rennes 17 26.3 65.0 2 ... 1.86 1.15 3.01 1.86 3.01
17 Strasbourg 18 28.8 49.5 2 ... 0.60 0.57 1.17 0.60 1.17
18 Toulouse 18 25.6 57.0 2 ... 0.58 0.58 1.15 0.58 1.15
19 Troyes 16 26.7 39.0 2 ... 0.91 0.23 1.14 0.52 0.75
[20 rows x 29 columns]
To print just the Age columns: print(df['Âge'])
仅打印年龄列: print(df['Âge'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.