如何使用 BeautifulSoup 從頁面解析地址？

Question

我想實現一個簡單的搜索引擎，在第一階段我從頁面中收集數據，然后進行搜索。 但是，嘗試從頁面獲取指向每個新聞項目的鏈接時，出現錯誤。 錯誤聽起來像這樣：

ConnectionError: HTTPConnectionPool(host='www.zrg74.ruhttp', port=80): 最大重試次數超過 url: //zrg74.ru/sport/item/26982-dorogoj-v-bolshoj-hokkej-v-zlatouste-namereny -sozdat-otdelnuju-sekciju-dlja-podgotovki-vratarej.html（由 NewConnectionError 引起（'<urllib3.connection.HTTPConnection object at 0x00000174B78FCBC8>：建立新連接失敗：[Errno 11001] getaddrinfo failed'）

這是一個代碼片段。 它有一個函數 get_page_text()，它以它的形式獲取頁面的來源：

...
response = requests.get(url, headers=headers, allow_redirects=True)
if response.status_code == 200:
        page_text = response.text
        return page_text
...

URL處理代碼如下：

soup = BeautifulSoup(page_text)
posts_list = soup.find_all('div', {'class': 'jeg_post_excerpt'}) 
for p in posts_list:
    lnk = p.find('a').attrs['href']
    title = re.sub('[^А-ЯЁа-яё0-9\s]', ' ', p.text)
    title = re.sub('\s\s+', ' ', title)
    page_url = 'http://www.zrg74.ru' + lnk
    clean_path = '/'.join([d for d in page_url.split('/')[2:] if len(d) > 0])

    page_text = get_page_text(page_url, USER_AGENT)
    if page_text is None:
        continue
    dir_path = 'data/raw_pages/' + '/'.join(clean_path.split('/')[:-1])
    makedirs(dir_path, exist_ok=True) 
    with open(dir_path + '/' + clean_path.split('/')[-1] + '.html', 'w', encoding='utf-8') as f:
        f.write(page_text)

我現階段需要的結果是這樣的：

{'http://zrg74.ru/obshhestvo/item/26959-rabota-ne-dlja-galochki-zlatoustovec-povedal-o-njuansah-raboty-perepischika.html',
 'http://zrg74.ru/obshhestvo/item/26954-vzjalis-vmeste-dve-semi-iz-zlatousta-prinjali-uchastie-v-oblastnom-festivale-dlja-zameshhajushhih-semej.html'}

Answer 1

我以前得到過這個。 通常，對我有用的是斷開我的 WiFi 路由器，等待幾秒鍾，然后重新連接。

嘗試禁用您的互聯網連接一段時間，然后再次啟用它。

如何使用 BeautifulSoup 從頁面解析地址？

問題描述

1 個解決方案

解決方案1
0 2021-11-13 13:28:29

如何使用 BeautifulSoup 從頁面解析地址？

問題描述

1 個解決方案

解決方案1 0 2021-11-13 13:28:29

解決方案1
0 2021-11-13 13:28:29