使用BeautifulSoup抓取頁面會產生奇怪的結果（多個 在末尾）。為什么？

Question

我正在嘗試使用BeautifulSoup刮一頁。 我想保留標記，以便以后將內容存儲在.xml文件中，分為段落，標頭等。不幸的是，結果令我有些驚訝。 看起來是這樣的：

為什么最后會有這么多 ？ 我習慣了看起來像這樣的結構：

<p>some paragraph... </p>
<p>next paragraph... </p>

不是這樣的：

some paragraph... <p>
next paragraph... <p></p>
</p>

當我檢查Chrome中的HTML結構時，一切看起來都很好：

為什么會這樣呢？ 這是我的代碼：

import os
import requests
from bs4 import BeautifulSoup

payload = {
'username': os.environ['POLITYKA_USERNAME'],
'password': os.environ['POLITYKA_PASSWORD'],
'login_success': 'http://archiwum.polityka.pl',
'login_error': 'https://archiwum.polityka.pl/art/grypa-nam 
niestraszna,378836.html'
}

login_url = 'https://www.polityka.pl/sso/login'
base_url = 'http://archiwum.polityka.pl'
example_url = 'https://archiwum.polityka.pl/art/sciganie- 
wnbsp;organach,378798.html'
with requests.Session() as session:
    session.headers={'User-Agent' : 'Mozilla/5.0'}
    post = session.post(login_url, data=payload)
    request = session.get(example_url)
    soup = BeautifulSoup(request.content, 'html.parser')
    box = soup.find('div', {'id' : 'container'}).find('div', {'class' : 'middle'}).find('div', {'class', 'right'}).find('div', {'class' : 'box'})
    content = box.find('p', {'class' : 'box_text'}).find_next_sibling()
    print(content)

Answer 1

從bs4提取

另一種選擇是純Python的html5lib解析器，它以Web瀏覽器的方式解析HTML。 根據您的設置，您可以使用以下命令之一安裝html5lib：

$ apt-get install python-html5lib

$ easy_install html5lib

$ pip install html5lib

話雖如此，您仍然需要使用find_next_siblings()復數形式

此外，您將需要一個參數到find_next_siblings()函數。

例：

get_html = 'https://archiwum.polityka.pl/art/sciganiewnbsp;organach,378798.html'
soup = bs4(get_html, 'html5lib')
find_location = soup.find('div', {'id' : 'container'}) \
                    .find('div', {'class' : 'middle'}) \
                    .find('div', {'class', 'right'}) \
                    .find('div', {'class' : 'box'}) \
                    .find('p', {'class' : 'box_text'}) \
                    .find_next_siblings('p')

for content in find_location:
    print(content)

只需將'html.parser'更改為'html5lib'和find_next_siblings('p')然后迭代list()

更好的是，添加條件語句以刪除空標簽

for content in find_location:
    if content.get_text() is not '':
        print(content)

試試看，讓我知道它是否有效。

使用BeautifulSoup抓取頁面會產生奇怪的結果（多個 </p> 在末尾）。為什么？

問題描述

1 個解決方案

解決方案1
1 已采納 2018-09-04 07:15:21

使用BeautifulSoup抓取頁面會產生奇怪的結果（多個 </p> 在末尾）。 為什么？

問題描述

1 個解決方案

解決方案1 1 已采納 2018-09-04 07:15:21

使用BeautifulSoup抓取頁面會產生奇怪的結果（多個 </p> 在末尾）。為什么？

解決方案1
1 已采納 2018-09-04 07:15:21