简体   繁体   English

如何使用 BeautifulSoup 从页面解析地址?

[英]How to parse addresses from a page using BeautifulSoup?

I want to implement a simple search engine and at the first stage I collect data from the page, which will then be searched.我想实现一个简单的搜索引擎,在第一阶段我从页面中收集数据,然后进行搜索。 However, trying to take links to each news item from the page, I get an error.但是,尝试从页面获取指向每个新闻项目的链接时,出现错误。 The error sounds like this:错误听起来像这样:

ConnectionError: HTTPConnectionPool(host='www.zrg74.ruhttp', port=80): Max retries exceeded with url: //zrg74.ru/sport/item/26982-dorogoj-v-bolshoj-hokkej-v-zlatouste-namereny-sozdat-otdelnuju-sekciju-dlja-podgotovki-vratarej.html (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000174B78FCBC8>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')) ConnectionError: HTTPConnectionPool(host='www.zrg74.ruhttp', port=80): 最大重试次数超过 url: //zrg74.ru/sport/item/26982-dorogoj-v-bolshoj-hokkej-v-zlatouste-namereny -sozdat-otdelnuju-sekciju-dlja-podgotovki-vratarej.html(由 NewConnectionError 引起('<urllib3.connection.HTTPConnection object at 0x00000174B78FCBC8>:建立新连接失败:[Errno 11001] getaddrinfo failed')

Here is a code snippet.这是一个代码片段。 It has a function get_page_text(), which gets the source of the page in the form in which it is:它有一个函数 get_page_text(),它以它的形式获取页面的来源:

...
response = requests.get(url, headers=headers, allow_redirects=True)
if response.status_code == 200:
        page_text = response.text
        return page_text
...

The URL processing code is as follows: URL处理代码如下:

soup = BeautifulSoup(page_text)
posts_list = soup.find_all('div', {'class': 'jeg_post_excerpt'}) 
for p in posts_list:
    lnk = p.find('a').attrs['href']
    title = re.sub('[^А-ЯЁа-яё0-9\s]', ' ', p.text)
    title = re.sub('\s\s+', ' ', title)
    page_url = 'http://www.zrg74.ru' + lnk
    clean_path = '/'.join([d for d in page_url.split('/')[2:] if len(d) > 0])

    page_text = get_page_text(page_url, USER_AGENT)
    if page_text is None:
        continue
    dir_path = 'data/raw_pages/' + '/'.join(clean_path.split('/')[:-1])
    makedirs(dir_path, exist_ok=True) 
    with open(dir_path + '/' + clean_path.split('/')[-1] + '.html', 'w', encoding='utf-8') as f:
        f.write(page_text)

The result I need at this stage is like this:我现阶段需要的结果是这样的:

{'http://zrg74.ru/obshhestvo/item/26959-rabota-ne-dlja-galochki-zlatoustovec-povedal-o-njuansah-raboty-perepischika.html',
 'http://zrg74.ru/obshhestvo/item/26954-vzjalis-vmeste-dve-semi-iz-zlatousta-prinjali-uchastie-v-oblastnom-festivale-dlja-zameshhajushhih-semej.html'}

I have gotten this before.我以前得到过这个。 Usually, what worked for me was disconnecting my WiFi router, waiting a few seconds, and then reconnecting.通常,对我有用的是断开我的 WiFi 路由器,等待几秒钟,然后重新连接。

Try disabling your internet connection for a while and then enable it again.尝试禁用您的互联网连接一段时间,然后再次启用它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM