简体   繁体   中英

Iterating over urls fails to find correct href in Python using BeautifulSoup

I am iterating through the website in the code. The following is what my code does. Loops through the 52 pages and gets the link to each URLs.

Then it iterates through those URLs and tries to get the link for the English Translation. if you see the Mongolian website, it has a section "Орчуулга" on the top right and it has "English" underneath - that is the link to the English translation.

However, my code fails to grab the link for the english translation and gives a wrong url. Below is a sample output for the first article.

1
{'https://mn.usembassy.gov/mn/2020-naadam-mn/': 'https://mn.usembassy.gov/mn/sitemap-mn/'}

The expected output for the first page should be

1
{'https://mn.usembassy.gov/mn/2020-naadam-mn/': 'https://mn.usembassy.gov/2020-naadam/'}

Below is my code

import requests
from bs4 import BeautifulSoup


url = 'https://mn.usembassy.gov/mn/news-events-mn/page/{page}/'

urls = []
for page in range(1, 53):
    print(str(page) + "/52")
    soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
    for h in soup.find_all('h2'):
        a = h.find('a')
        urls.append(a.attrs['href'])

print(urls)

i = 0
bilingual_dict = {}
for url in urls:
    i += 1
    print(i)
    soup = BeautifulSoup(requests.get(url.format(page=url)).content, 'html.parser')
    for div in soup.find_all('div', class_='translations_sidebar'):
        for ul in soup.find_all('ul'):
            for li in ul.find_all('li'):
                a = li.find('a')
    bilingual_dict[url] = a['href']
    print(bilingual_dict)
print(bilingual_dict)

This script will print link to english translation:

import requests
from bs4 import BeautifulSoup


url = 'https://mn.usembassy.gov/mn/2020-naadam-mn/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

link = soup.select_one('a[hreflang="en"]')
print(link['href'])

Prints:

https://mn.usembassy.gov/2020-naadam/

Complete code: (Where there isn't link to english translation, the value is set to None )

import requests
from bs4 import BeautifulSoup
from pprint import pprint

url = 'https://mn.usembassy.gov/mn/news-events-mn/page/{page}/'

urls = []
for page in range(1, 53):
    print('Page {}...'.format(page))
    soup = BeautifulSoup(requests.get(url.format(page=page)).content, 'html.parser')
    for h in soup.find_all('h2'):
        a = h.find('a')
        urls.append(a.attrs['href'])

pprint(urls)

bilingual_dict = {}
for url in urls:
    print(url)
    soup = BeautifulSoup(requests.get(url).content, 'html.parser')
    link = soup.select_one('a[hreflang="en"]')
    bilingual_dict[url] = link['href'] if link else None

pprint(bilingual_dict)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM