如何使用 BeautifulSoup 从页面解析地址？

Question

I want to implement a simple search engine and at the first stage I collect data from the page, which will then be searched.我想实现一个简单的搜索引擎，在第一阶段我从页面中收集数据，然后进行搜索。 However, trying to take links to each news item from the page, I get an error.但是，尝试从页面获取指向每个新闻项目的链接时，出现错误。 The error sounds like this:错误听起来像这样：

ConnectionError: HTTPConnectionPool(host='www.zrg74.ruhttp', port=80): Max retries exceeded with url: //zrg74.ru/sport/item/26982-dorogoj-v-bolshoj-hokkej-v-zlatouste-namereny-sozdat-otdelnuju-sekciju-dlja-podgotovki-vratarej.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000174B78FCBC8>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')) ConnectionError: HTTPConnectionPool(host='www.zrg74.ruhttp', port=80): 最大重试次数超过 url: //zrg74.ru/sport/item/26982-dorogoj-v-bolshoj-hokkej-v-zlatouste-namereny -sozdat-otdelnuju-sekciju-dlja-podgotovki-vratarej.html（由 NewConnectionError 引起（'<urllib3.connection.HTTPConnection object at 0x00000174B78FCBC8>：建立新连接失败：[Errno 11001] getaddrinfo failed'）

Here is a code snippet.这是一个代码片段。 It has a function get_page_text(), which gets the source of the page in the form in which it is:它有一个函数 get_page_text()，它以它的形式获取页面的来源：

...
response = requests.get(url, headers=headers, allow_redirects=True)
if response.status_code == 200:
        page_text = response.text
        return page_text
...

The URL processing code is as follows: URL处理代码如下：

soup = BeautifulSoup(page_text)
posts_list = soup.find_all('div', {'class': 'jeg_post_excerpt'}) 
for p in posts_list:
    lnk = p.find('a').attrs['href']
    title = re.sub('[^А-ЯЁа-яё0-9\s]', ' ', p.text)
    title = re.sub('\s\s+', ' ', title)
    page_url = 'http://www.zrg74.ru' + lnk
    clean_path = '/'.join([d for d in page_url.split('/')[2:] if len(d) > 0])

    page_text = get_page_text(page_url, USER_AGENT)
    if page_text is None:
        continue
    dir_path = 'data/raw_pages/' + '/'.join(clean_path.split('/')[:-1])
    makedirs(dir_path, exist_ok=True) 
    with open(dir_path + '/' + clean_path.split('/')[-1] + '.html', 'w', encoding='utf-8') as f:
        f.write(page_text)

The result I need at this stage is like this:我现阶段需要的结果是这样的：

{'http://zrg74.ru/obshhestvo/item/26959-rabota-ne-dlja-galochki-zlatoustovec-povedal-o-njuansah-raboty-perepischika.html',
 'http://zrg74.ru/obshhestvo/item/26954-vzjalis-vmeste-dve-semi-iz-zlatousta-prinjali-uchastie-v-oblastnom-festivale-dlja-zameshhajushhih-semej.html'}

Answer 1

I have gotten this before.我以前得到过这个。 Usually, what worked for me was disconnecting my WiFi router, waiting a few seconds, and then reconnecting.通常，对我有用的是断开我的 WiFi 路由器，等待几秒钟，然后重新连接。

Try disabling your internet connection for a while and then enable it again.尝试禁用您的互联网连接一段时间，然后再次启用它。

如何使用 BeautifulSoup 从页面解析地址？

问题描述

1 个解决方案

解决方案1
0 2021-11-13 13:28:29

如何使用 BeautifulSoup 从页面解析地址？

问题描述

1 个解决方案

解决方案1 0 2021-11-13 13:28:29

解决方案1
0 2021-11-13 13:28:29