简体   繁体   中英

Trouble with scraping links with BeautifulSoup

Here's my script:

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.boursorama.com/bourse/opcvm/'

results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")

n=2
m = 2

linklist = []

while n <= m:
      # Source path
      url = f"https://www.boursorama.com/bourse/opcvm/page-{n}?sortAsc=1"
      results = requests.get(url0, headers = headers)
      soup = BeautifulSoup(results.text, "html.parser")

      links = soup.find_all('div', class_ = "o-pack__item u-ellipsis", attrs={"href"})

      for i in links:
        print(i.get('href'))

      n = n+1

I obtained this:

None
None
None
None
None

I don't understand why, when I run this (I just changed a line of code: print(i) :

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.boursorama.com/bourse/opcvm/'

results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")

n=2
m = 3

linklist = []

while n <= m:
      # Source path
      url = f"https://www.boursorama.com/bourse/opcvm/page-{n}?sortAsc=1"
      results = requests.get(url0, headers = headers)
      soup = BeautifulSoup(results.text, "html.parser")

      links = soup.find_all('div', class_ = "o-pack__item u-ellipsis", attrs={"href"})

      for i in links:
        print(i)

      n = n+1

I obtained this:

<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001CBIB/" title="Allianz Global Government Bond W H EUR">Allianz Global Government Bond W H EUR</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001HEV7/" title="Allianz Global Credit SRI WT Hedged SEK">Allianz Global Credit SRI WT Hedged SEK</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001AM5S/" title="Barings Global High Yield Bond C AUD Acc">Barings Global High Yield Bond C AUD Acc</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P00012TCF/" title="GS NA Engy &amp; Engy Infras Eq Base Inc USD">GS NA Engy &amp; Engy Infras Eq Base Inc USD</a></div>
<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001ESKF/" title="DWS Invest Enh Cmdty Strat USD TFC">DWS Invest Enh Cmdty Strat USD TFC</a></div>

We can see the href tag, I searched on inte.net but I always see i.get('href') or i['href'] and it always end up with None .

What happens?

You are selecting a <div> and these do not have an attribute href , the <a> is enclosed by the <div>

<div class="o-pack__item u-ellipsis"><a class="c-link c-link--animated" href="/bourse/opcvm/cours/0P0001CBIB/" title="Allianz Global Government Bond W H EUR">Allianz Global Government Bond W H EUR</a></div>

How to fix

If you like to print the value of the href , you have to select it first:

print(i.a.get('href'))

An alternativ would be to select your targets more specific eg with css selectors :

links = soup.select('div.o-pack__item.u-ellipsis a[href]')

for i in links:
    print(i.get('href'))

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.boursorama.com/bourse/opcvm/'

results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")

n=2
m = 2

linklist = []

while n <= m:
      # Source path
    url = f"https://www.boursorama.com/bourse/opcvm/page-{n}?sortAsc=1"
    results = requests.get(url0, headers = headers)
    soup = BeautifulSoup(results.text, "html.parser")

    links = soup.find_all('div', class_ = "o-pack__item u-ellipsis", attrs={"href"})

    for i in links:
        print(i.a.get('href'))

    n = n+1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM