BeautifulSoup 仅检索第一个元素

Question

I'm currently trying to webscrap some website.我目前正在尝试 webscrap 一些网站。

Here is part of my code:这是我的代码的一部分：

import pandas as pd
from bs4 import BeautifulSoup
import requests


#Get the source code
r = requests.get('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard')
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())

for a in soup:
    print (soup.find("td", {"data-stat" : "avg_age"}).text)

Basically, I have the whole source code inside "soup".基本上，我在“汤”中有完整的源代码。 However, when I call elements such as "td", {"data-stat": "avg_age"} I only get repeated result of the first row {"data-row":"0"} as an output:但是，当我调用诸如“td”，{“data-stat”：“avg_age”}之类的元素时，我只会将第一行{“data-row”：“0”}的重复结果作为output：

29.1
29.1
29.1
29.1
29.1

So here are my questions:所以这是我的问题：

-> Why my code is stuck in first row while there is no preselection in my "soup" variable? -> 为什么我的代码卡在第一行，而我的“汤”变量中没有预选？

-> Is there a way to make a loop that could check all the wanted elements for a different row each time? -> 有没有办法制作一个循环，每次都可以检查所有想要的元素是否有不同的行？ "data-row":"0" to "data-row":19 for instance.例如，“数据行”：“0”到“数据行”：19。

Thanks for your support and have a great day !感谢您的支持，祝您有美好的一天！

Answer 1

It's stuck in the first row for a couple of reasons:它卡在第一行有几个原因：

you are using the .find() which only returns the first element it "finds" in the html soup object.您正在使用.find() ，它只返回它在 html 汤 object 中“找到”的第一个元素。
You never iterate through anything.你永远不会遍历任何东西。 soup.find("td", {"data-stat": "avg_age"}).text will always return the same thing. soup.find("td", {"data-stat": "avg_age"}).text将始终返回相同的内容。 Look at your loop.看看你的循环。

Essentially this would be the same logic as you have there:本质上，这与您在那里的逻辑相同：

for x in [1, 2, 3, 4]:
    print(1)

As it iterates through that list, it's just going to print 1 and you will get the 1 4 times in your console.当它遍历该列表时，它只会打印1并且您将在控制台中获得1 4 次。

You need to get all the rows in soup with soup.find_all('tr') .您需要使用soup.find_all('tr')获取soup中的所有行。 Then when you iterate, if there is a <td> class with attribute data-stat="avg_age" , only then do you want to .find() it and get the text.然后当你迭代时，如果有一个<td> class 属性data-stat="avg_age" ，那么你才想要.find()它并获取文本。

import pandas as pd
from bs4 import BeautifulSoup
import requests


#Get the source code
r = requests.get('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard')
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())


rows = soup.find_all('tr')
for a in rows:
    if a.find("td", {"data-stat" : "avg_age"}):
        print (a.find("td", {"data-stat" : "avg_age"}).text)

Output: Output：

29.1
26.8
29.4
26.8
27.8
26.2
27.2
25.8
26.0
26.9
24.8
25.5
26.9
25.9
27.6
24.5
26.3
28.8
25.6
26.7
26.1
28.2
26.9
26.6
26.0
27.7
28.0
26.8
29.9
25.5
27.1
27.1
27.1
27.2
27.0
27.0
25.1
25.8
25.9
25.8

Just as note, pandas ' .read_html() uses bs4 under the hood to parse <table> tags.请注意， pandas ' .read_html()在后台使用 bs4 来解析<table>标签。 Use that.用那个。 It's fair more easier.这更容易。

import pandas as pd

df = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0]

Output: Output：

print(df)
           Équipe  # JC   Âge  Poss  MJ  ...  xG.1  xA.1  xG+xA  npxG.1  npxG+xA.1
0         Ajaccio    18  29.1  34.5   2  ...  0.59  0.14   0.73    0.20       0.34
1          Angers    18  26.8  55.0   2  ...  1.00  0.49   1.49    1.00       1.49
2         Auxerre    15  29.4  39.5   2  ...  0.43  0.43   0.85    0.43       0.85
3           Brest    18  26.8  42.5   2  ...  0.63  0.23   0.86    0.23       0.47
4   Clermont Foot    18  27.8  48.5   2  ...  0.17  0.07   0.24    0.17       0.24
5            Lens    16  26.2  63.0   2  ...  1.48  0.94   2.41    1.08       2.02
6           Lille    18  27.2  65.0   2  ...  2.02  1.65   3.66    2.02       3.66
7         Lorient    14  25.8  36.0   1  ...  0.37  0.26   0.63    0.37       0.63
8            Lyon    15  26.0  68.0   1  ...  1.52  0.49   2.00    0.73       1.22
9       Marseille    17  26.9  55.0   2  ...  1.10  0.89   1.99    1.10       1.99
10         Monaco    19  24.8  40.5   2  ...  2.75  1.21   3.96    2.36       3.57
11    Montpellier    19  25.5  47.5   2  ...  0.93  0.66   1.59    0.93       1.59
12         Nantes    16  26.9  40.5   2  ...  1.37  0.60   1.97    1.37       1.97
13           Nice    18  25.9  54.0   2  ...  0.49  0.40   0.88    0.49       0.88
14      Paris S-G    18  27.6  60.0   2  ...  3.05  1.76   4.81    2.27       4.03
15          Reims    18  24.5  43.0   2  ...  0.54  0.42   0.96    0.54       0.96
16         Rennes    17  26.3  65.0   2  ...  1.86  1.15   3.01    1.86       3.01
17     Strasbourg    18  28.8  49.5   2  ...  0.60  0.57   1.17    0.60       1.17
18       Toulouse    18  25.6  57.0   2  ...  0.58  0.58   1.15    0.58       1.15
19         Troyes    16  26.7  39.0   2  ...  0.91  0.23   1.14    0.52       0.75

[20 rows x 29 columns]

To print just the Age columns: print(df['Âge'])仅打印年龄列： print(df['Âge'])

BeautifulSoup 仅检索第一个元素

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-08-15 09:38:07

BeautifulSoup 仅检索第一个元素

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-08-15 09:38:07

解决方案1
1 已采纳 2022-08-15 09:38:07