简体   繁体   English

BeautifulSoup 仅检索第一个元素

[英]BeautifulSoup Only Retrieves 1st element

I'm currently trying to webscrap some website.我目前正在尝试 webscrap 一些网站。

Here is part of my code:这是我的代码的一部分:

import pandas as pd
from bs4 import BeautifulSoup
import requests


#Get the source code
r = requests.get('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard')
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())

for a in soup:
    print (soup.find("td", {"data-stat" : "avg_age"}).text)

Basically, I have the whole source code inside "soup".基本上,我在“汤”中有完整的源代码。 However, when I call elements such as "td", {"data-stat": "avg_age"} I only get repeated result of the first row {"data-row":"0"} as an output:但是,当我调用诸如“td”,{“data-stat”:“avg_age”}之类的元素时,我只会将第一行{“data-row”:“0”}的重复结果作为output:

29.1
29.1
29.1
29.1
29.1

So here are my questions:所以这是我的问题:

-> Why my code is stuck in first row while there is no preselection in my "soup" variable? -> 为什么我的代码卡在第一行,而我的“汤”变量中没有预选?

-> Is there a way to make a loop that could check all the wanted elements for a different row each time? -> 有没有办法制作一个循环,每次都可以检查所有想要的元素是否有不同的行? "data-row":"0" to "data-row":19 for instance.例如,“数据行”:“0”到“数据行”:19。

Thanks for your support and have a great day !感谢您的支持,祝您有美好的一天!

It's stuck in the first row for a couple of reasons:它卡在第一行有几个原因:

  1. you are using the .find() which only returns the first element it "finds" in the html soup object.您正在使用.find() ,它只返回它在 html 汤 object 中“找到”的第一个元素。
  2. You never iterate through anything.你永远不会遍历任何东西。 soup.find("td", {"data-stat": "avg_age"}).text will always return the same thing. soup.find("td", {"data-stat": "avg_age"}).text将始终返回相同的内容。 Look at your loop.看看你的循环。

Essentially this would be the same logic as you have there:本质上,这与您在那里的逻辑相同:

for x in [1, 2, 3, 4]:
    print(1)

As it iterates through that list, it's just going to print 1 and you will get the 1 4 times in your console.当它遍历该列表时,它只会打印1并且您将在控制台中获得1 4 次。

You need to get all the rows in soup with soup.find_all('tr') .您需要使用soup.find_all('tr')获取soup中的所有行。 Then when you iterate, if there is a <td> class with attribute data-stat="avg_age" , only then do you want to .find() it and get the text.然后当你迭代时,如果有一个<td> class 属性data-stat="avg_age" ,那么你才想要.find()它并获取文本。

import pandas as pd
from bs4 import BeautifulSoup
import requests


#Get the source code
r = requests.get('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard')
c = r.content
soup = BeautifulSoup(c, "html.parser")
print(soup.prettify())


rows = soup.find_all('tr')
for a in rows:
    if a.find("td", {"data-stat" : "avg_age"}):
        print (a.find("td", {"data-stat" : "avg_age"}).text)

Output: Output:

29.1
26.8
29.4
26.8
27.8
26.2
27.2
25.8
26.0
26.9
24.8
25.5
26.9
25.9
27.6
24.5
26.3
28.8
25.6
26.7
26.1
28.2
26.9
26.6
26.0
27.7
28.0
26.8
29.9
25.5
27.1
27.1
27.1
27.2
27.0
27.0
25.1
25.8
25.9
25.8

Just as note, pandas ' .read_html() uses bs4 under the hood to parse <table> tags.请注意, pandas ' .read_html()在后台使用 bs4 来解析<table>标签。 Use that.用那个。 It's fair more easier.这更容易。

import pandas as pd

df = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0]

Output: Output:

print(df)
           Équipe  # JC   Âge  Poss  MJ  ...  xG.1  xA.1  xG+xA  npxG.1  npxG+xA.1
0         Ajaccio    18  29.1  34.5   2  ...  0.59  0.14   0.73    0.20       0.34
1          Angers    18  26.8  55.0   2  ...  1.00  0.49   1.49    1.00       1.49
2         Auxerre    15  29.4  39.5   2  ...  0.43  0.43   0.85    0.43       0.85
3           Brest    18  26.8  42.5   2  ...  0.63  0.23   0.86    0.23       0.47
4   Clermont Foot    18  27.8  48.5   2  ...  0.17  0.07   0.24    0.17       0.24
5            Lens    16  26.2  63.0   2  ...  1.48  0.94   2.41    1.08       2.02
6           Lille    18  27.2  65.0   2  ...  2.02  1.65   3.66    2.02       3.66
7         Lorient    14  25.8  36.0   1  ...  0.37  0.26   0.63    0.37       0.63
8            Lyon    15  26.0  68.0   1  ...  1.52  0.49   2.00    0.73       1.22
9       Marseille    17  26.9  55.0   2  ...  1.10  0.89   1.99    1.10       1.99
10         Monaco    19  24.8  40.5   2  ...  2.75  1.21   3.96    2.36       3.57
11    Montpellier    19  25.5  47.5   2  ...  0.93  0.66   1.59    0.93       1.59
12         Nantes    16  26.9  40.5   2  ...  1.37  0.60   1.97    1.37       1.97
13           Nice    18  25.9  54.0   2  ...  0.49  0.40   0.88    0.49       0.88
14      Paris S-G    18  27.6  60.0   2  ...  3.05  1.76   4.81    2.27       4.03
15          Reims    18  24.5  43.0   2  ...  0.54  0.42   0.96    0.54       0.96
16         Rennes    17  26.3  65.0   2  ...  1.86  1.15   3.01    1.86       3.01
17     Strasbourg    18  28.8  49.5   2  ...  0.60  0.57   1.17    0.60       1.17
18       Toulouse    18  25.6  57.0   2  ...  0.58  0.58   1.15    0.58       1.15
19         Troyes    16  26.7  39.0   2  ...  0.91  0.23   1.14    0.52       0.75

[20 rows x 29 columns]

To print just the Age columns: print(df['Âge'])仅打印年龄列: print(df['Âge'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 webscraping selenium:我的循环总是只得到第一个元素 - webscraping selenium : my loop always get only the 1st element 搜寻网站仅检索Beautifulsoup 3.2.1列表中的第一项。 Python 2.7 - Scraping website only retrieve 1st item on list Beautifulsoup 3.2.1. Python 2.7 将列表列表的第一个元素与另一个列表进行比较,然后仅添加与新列表匹配的列表 - Comparing the 1st element of list of lists to another list then adding only the lists that match to a new list 带有 for 循环的美丽汤 find_all 仅返回第一个元素 - Beautiful Soup find_all with for loop only returning 1st element 如何对列表列表进行排序并按间隔仅保留每个第一个元素的最大第二个元素? - How to sort a list of lists and and to keep only the maximal 2nd element of each of the 1st elements by intervals? 如何对列表列表进行排序并仅保留每个第一个元素的最大第二个元素? - How to sort a list of lists and and to keep only the maximal 2nd element of each of the 1st elements? Python for 循环没有遍历所有元素? 它只取第一个元素 - Python for loop not looping through all the elements? It is only taking the 1st element 如何在 HTML 标签中打印第一个元素 - how to print 1st element in HTML tag for循环跳过第一和第二元素 - for loop skipping 1st and 2nd element 循环仅读取文件的第一行 - Loop only reads 1st line of file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM