簡體   English   中英

使用Beautiful Soup進行網頁抓取時出現奇怪的字符

[英]Weird characters when webscraping using Beautiful Soup

我正在嘗試從eshop網站將html作為字符串返回,但又找回了一些奇怪的字符。 當我查看網絡控制台時,我在html中看不到這些字符。 當在jupyter筆記本的pandas數據框中顯示html時,我也看不到這些字符。 鏈接為https://www.powerhousefilms.co.uk/collections/limited-editions/products/immaculate-conception-le 我在本網站上的另一產品上也使用了相同的方法,但是在此頁面上只能看到這些字符。 該站點中的其他頁面沒有此問題。

html = requests.get(url).text
soup = BeautifulSoup(html)
elem = soup.find_all('div', {'class': product-single_description rte'})
s = str(elem[0])

s然后看起來像:

    <div class="product-single__description rte">
<div class="product_description">
<div>
<div>
<div><span style="color: #000000;"><em>THIS ITEM IS AVAILABLE TO PRE-ORDER. PLEASE NOTE THAT YOUR PAYMENT WILL BE TAKEN IMMEDIATELY, AND THAT THE ITEM WILL BE DISPATCHED JUST BEFORE THE LISTED RELEASE DATE. </em></span></div>
<div><span style="color: #000000;"><em>Â </em></span></div>
<div><span style="color: #000000;"><em>SHOULD YOU ORDER ANY OF THEÂ ALREADY RELEASED ITEMS FROM OURÂ CATALOGUE AT THE SAME TIME AS THIS PRE-ORDER ITEM, PLEASE NOTE THATÂ YOUR PURCHASES WILL ALL BE SHIPPED TOGETHER WHENÂ THIS PRE-ORDERÂ ITEM BECOMES AVAILABLE.</em></span></div>
</div>
<div><span style="color: #38761d;">Â </span></div>
<div>
<strong>(Jamil Dehlavi, 1992)</strong><br/><em>Release date: 25 March 2019</em><br/>Limited Blu-ray Edition (World Blu-ray premiere)<br/><br/>A Western couple (played by Melissa Leo and James Wilby) working in Pakistan visit an unconventional holy shrine to harness its spiritual powers to help them conceive a child. They are lavished with the attentions of the shrine’s leader (an exceptional performance from Zia Mohyeddin – <em>Lawrence of Arabia</em>, <em>Khartoum</em>) and her followers, but their methods and motives are not all that they seem, and the couple’s lives are plunged into darkness.<br/><br/>This ravishing, unsettling film from director Jamil Dehlavi (<em>The Blood of Hussain</em>, <em>Born of Fire</em>) is a deeply personal work which raises questions of cultural and sexual identity, religious fanaticism and the abuses of power. The brand-new 2K restoration from the original negative was supervised and approved by Dehlavi and cinematographer Nic Knowland.<br/><br/><strong>INDICATOR LIMITED EDITION BLU-RAY SPECIAL FEATURES:</strong>
</div>
<div>
<ul>
<li>New 2K restoration by Powerhouse Films from the original negative, supervised and approved by director Jamil Dehlavi and cinematographer Nic Knowland</li>
<li>
<div>Original stereo audio</div>
</li>
<li>
<div>Alternative original mono mix</div>

我嘗試指定編碼,但仍然得到奇怪的字符。 對於該網站上的50多種產品,只有少數產品存在此問題。

我的抓取方式是否有問題,或者可能是清除此問題的簡便方法?

謝謝

使用這段代碼可以下載網頁中的可見內容。 只需在page_url中輸入網址

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.request
import os


page_url = "URL Here"
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)
    return u" ".join(t.strip() for t in visible_texts)

def Extract_Text(html_bytes, url):
    text_data = text_from_html(html_bytes)
    f = open("DOC.txt", "w")
    string = str(url) + "\n" + text_data
    f.write(str(string))
    f.close()

html_string = ''
response = urlopen(page_url)
if 'text/html' in response.getheader('Content-Type'):
    html_bytes = response.read()
    html_string = html_bytes.decode("utf-8")
Extract_Text(html_bytes, page_url)

因此事實證明excel是造成這種情況的原因。 當我保存到CSV並在excel中打開時,我得到了奇怪的結果。

為了防止這種情況,我使用了df.to_csv('df.csv', index=False, encoding = 'utf-8-sig') 指定編碼擺脫了奇怪的字符。

Python將奇異的Unicode編碼為CSV包含有關編碼的信息以及excel如何穿透csv文件。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM