简体   繁体   中英

Scraping web site

I have this:

 from bs4 import BeautifulSoup
 import requests

 page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
 soup = BeautifulSoup(page.content, 'html.parser')
 equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})

 aux = []
 for equipo in equipos:
     aux.append(equipo)

If i do print(aux[0]) i got this: , Villarreal

Entrenador:
Javier Calleja
Jugadores:
  • 1 Sergio Asenjo
  • 13 Andrés Fernández
  • 25 Mariano Barbosa
  • ...

    And my problem is i want to take the tag:

      <h2 class="cintillo">Villarreal</h2> 

    And the tag:

  • 1 Sergio Asenjo
  • And put it into a bataBase How can i take that? Thanks

    You can extract the first <h2 class="cintillo"> element from equipo like this:

    h2 = str(equipo.find('h2', {'class':'cintillo'}))
    

    If you only want the inner HTML (without any tags), use:

    h2 = equipo.find('h2', {'class':'cintillo'}).text
    

    And you can extract all the <span class="dorsal-jugador"> elements from equipo like this:

    jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
    

    Then append h2 and jugadores to a multi-dimensional list.

    Full code:

    from bs4 import BeautifulSoup
    import requests
    
    page = requests.get("https://www.marca.com/futbol/primera/equipos.html")
    soup = BeautifulSoup(page.content, 'html.parser')
    equipos = soup.findAll('li', attrs={'id':'nombreEquipo'})
    
    aux = []
    for equipo in equipos:
            h2 = equipo.find('h2', {'class':'cintillo'}).text
            jugadores = equipo.find_all('span', {'class':'dorsal-jugador'})
            aux.append([h2,[j.text for j in jugadores]])
    
    # format list for printing
    print('\n\n'.join(['--'+i[0]+'--\n' + '\n'.join(i[1])  for i in aux]))
    

    Output sample:

    --Alavés--
    Fernando Pacheco
    Antonio Sivera
    Álex Domínguez
    Carlos Vigaray
    ...
    

    Demo: https://repl.it/@glhr/55550385

    You could create a dictionary of team names as keys with lists of [entrenador, players ] as values

    import requests
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('https://www.marca.com/futbol/primera/equipos.html')
    soup = bs(r.content, 'lxml')
    
    teams = {}
    
    for team in soup.select('[id=nombreEquipo]'):
        team_name = team.select_one('.cintillo').text 
        entrenador = team.select_one('dd').text
        players = [item.text for item in team.select('.dorsal-jugador')]
        teams[team_name] = {entrenador : players}
    print(teams)
    

    The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

     
    粤ICP备18138465号  © 2020-2024 STACKOOM.COM