简体   繁体   English

Python 3 Beautiful Soup Web Scraping

[英]Python 3 Beautiful Soup Web Scraping

I'm currently working with BeautifulSoup. 我目前正在与BeautifulSoup合作。 I seem to be having some issues related to encoding. 我似乎遇到一些与编码有关的问题。

Here is my code: 这是我的代码:

import requests
from bs4 import BeautifulSoup
req = requests.get('https://pythonprogramming.net/parsememcparseface/')
soup = BeautifulSoup(req.content.decode('utf-8','ignore'))
print(soup.find_all('p'))

Here is my error: 这是我的错误:

 UnicodeEncodeError: 'ascii' codec can't encode character '\u1d90' in position 602: ordinal not in range(128)

Any help would be appreciated. 任何帮助,将不胜感激。

Please add "html5lib" or "html.parser" 请添加“ html5lib”或“ html.parser”

#!/usr/bin/python
# -*- coding: utf-8 -*-

...

# Python 3.6.0
soup = BeautifulSoup(req.content.decode('utf-8','ignore'), "html5lib")

# Python 2.7.12
soup = BeautifulSoup(req.content.decode('utf-8','ignore'), "html.parser")

I tried to reproduce the issue that you are facing here but was not able to. 我试图重现您在此处面临的问题,但未能解决。

Here is what I tried. 这是我尝试过的。

>>> import requests
>>> from bs4 import BeautifulSoup

>>> req = requests.get('https://pythonprogramming.net/parsememcparseface/')

>>> soup = BeautifulSoup(req.content.decode('utf-8','ignore'))


Warning (from warnings module):
  File "C:\Python34\lib\site-packages\bs4\__init__.py", line 166
    markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

>>> soup = BeautifulSoup(req.content.decode('utf-8','ignore'), 'html.parser')
>>> print(soup.find_all('p'))
[<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>, <p>The following table gives some general information for the following <code>programming languages</code>:</p>, <p>I think it's clear that, on a scale of 1-10, python is:</p>, <p>Javascript (dynamic data) test:</p>, <p class="jstest" id="yesnojs">y u bad tho?</p>, <p>Whᶐt hαppéns now¿</p>, <p><a href="/sitemap.xml" target="blank"><strong>sitemap</strong></a></p>, <p>
<a class="btn btn-flat white modal-close" href="#">Cancel</a>  
                        <a class="waves-effect waves-blue blue btn btn-flat modal-action modal-close" href="#">Login</a>
</p>, <p>
<a class="btn btn-flat white modal-close" href="#">Cancel</a>  
                                <button class="btn" type="submit" value="Register">Sign Up</button>
</p>, <p class="grey-text text-lighten-4">Contact: Harrison@pythonprogramming.net.</p>, <p class="grey-text right" style="padding-right:10px">Programming is a superpower.</p>]

I can duplicate your error message and eliminate troublesome characters. 我可以复制您的错误消息并消除麻烦的字符。

First this code simply requests the page and attempts to save it. 首先,此代码仅请求页面并尝试保存它。 The attempt fails with the message you have seen. 尝试失败,并显示您看到的消息。 I create a copy of the page by converting it to bytes ignoring ugly character codes and then converting it back to characters. 我通过将页面转换为忽略丑陋字符代码的字节,然后将其转换回字符来创建页面的副本。 Now the page can be saved successfully. 现在可以成功保存页面。

I make soup with it and find the paragraph tags. 我用它做汤,然后找到段落标签。

>>> from bs4 import BeautifulSoup
>>> import requests
>>> req = requests.get('https://pythonprogramming.net/parsememcparseface/').text
>>> open('c:/scratch/temp.htm', 'w').write(req)
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u1d90' in position 6702: character maps to <undefined>
>>> modReq = str(req.encode('utf-8', 'ignore'))
>>> open('c:/scratch/temp.htm', 'w').write(modReq)
12556
>>> soup = BeautifulSoup(modReq, 'lxml')
>>> paras = soup.findAll('p')
>>> len(paras)
12

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM