简体   繁体   中英

How to get more than one item with identical html tag on BeautifulSoup

I am new to BeautifulSoup and I am not that familiar with Html.. But I am learning and I am finding myself some little projects to do. For this one, what I want is to get the football match info from this site , like TeamA Date/time TeamB.

Here is my code

import requests
from bs4 import BeautifulSoup

url = 'https://www.lequipe.fr/Football/ligue-1/page-calendrier-resultats/21e-journee'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

all_result = soup.find('div', class_="grid")

all_pairs = all_result.find_all('div', class_='grid__item')

i = 0
for result in all_pairs:
    i = i + 1
    team_name = result.find('span', class_='TeamScore__nameshort')  
    calendrier = result.find('div', class_='TeamScore__data')

    
    
    print(i)
    print(team_name.text.strip())
    print(calendrier.text.strip())
    print()

My problems are:

  1. It only grab the first team. Like Nice vs. Rennes, but it only gets "Nice". The Html tags for TeamA and TeamB seem the same to me. I checked find_all , but it did not work neither.

  2. For whatever reason, the Date/Time it gets are wrong. It shows some completely different dates and time. I don't know why..

Thank you for your help.

You can use

element = soup.select('div.grid__item')
firstElement = element[0].get_text()

Another Example to get an attribute for the following html code:

<div class="nextpage">
    <a class="next-story" href="somepage.html">Some Page</a>
    <a class="next-story" href="somepage2.html">Some Page 2</a>
    <a class="next-story" href="somepage3.html">Some Page 3</a>
</div>

Code would be:

link = soup.select('div.nextpage a.next-story')
href = link[0].get('href')

When you print href, it would return 'somepage.html'

find_all is indeed the function you are after.

Try this:

import requests
from bs4 import BeautifulSoup

url = 'https://www.lequipe.fr/Football/ligue-1/page-calendrier-resultats/21e-journee'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

all_result = soup.find('div', class_="grid")

all_pairs = all_result.find_all('div', class_='grid__item')

i = 0
for result in all_pairs:
    i = i + 1
    team_names = result.find_all('span', class_='TeamScore__nameshort')
    first_team_name = team_names[0]
    second_team_name =  team_names[1]
    calendrier = result.find('div', class_='TeamScore__data')



    print(i)
    print('{} vs {}'.format(first_team_name.text.strip(), second_team_name.text.strip()))
    print(calendrier.text.strip())
    print()

which should output:

1
Nice vs Rennes
24 janv.
                    20h45

2
Marseille vs Angers
25 janv.
                    17h30

3
Montpellier vs Dijon
25 janv.
                    20h00

4
Monaco vs Strasbourg
25 janv.
                    20h00

5
Reims vs Metz
25 janv.
                    20h00

6
Brest vs Amiens
25 janv.
                    20h00

7
Saint-Étienne vs Nîmes
25 janv.
                    20h00

8
Lyon vs Toulouse
26 janv.
                    15h00

9
Nantes vs Bordeaux
26 janv.
                    17h00

10
Lille vs Paris-SG
26 janv.
                    21h00

find_all just returns a list of elements so you will have to use an index to access the element you want (or alternatively, iterate over the list).

As for the dates being different, I haven't looked into it but one reason could be that when you visit the site in your browser, the dates are changed by JS to be in your local timezone. By getting the site with BeautifulSoup, you would be getting the default timezone dates.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM