簡體   English   中英

編碼錯誤:輸入轉換因輸入錯誤而失敗,使用Flask和BeautifulSoup時字節為0x9D 0x29 0x2E 0x20

[英]encoding error : input conversion failed due to input error, bytes 0x9D 0x29 0x2E 0x20 when using Flask and BeautifulSoup

我正在制作一個Web Scraper來抓取GSoC組織信息。 我試圖通過使用Flask在瀏覽器上顯示輸出。 但是我收到了這個錯誤:

(venv) astanwar99@astanwar99-Predator-G3-572:~/DEVSPACE/WebDev/Web_Scrap_GSOC/GSoC-OrganisationScraper$ python scrape.py
* Serving Flask app "scrape" (lazy loading)
* Environment: production
WARNING: Do not use the development server in a production 
environment.
Use a production WSGI server instead.
* Debug mode: on
* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
* Restarting with stat
* Debugger is active!
* Debugger PIN: 684-363-716
127.0.0.1 - - [22/Mar/2019 12:26:08] "GET / HTTP/1.1" 200 -
scrape.py:43: UserWarning: No parser was explicitly specified, so I'm 
using the best available HTML parser for this system ("lxml"). This 
usually isn't a problem, but if you run this code on another system, 
or in a different virtual environment, it may use a different parser 
and behave differently.

The code that caused this warning is on line 43 of the file scrape.py. To get rid of this warning, pass the additional argument  
'features="lxml"' to the BeautifulSoup constructor.

soup = BeautifulSoup(html)
encoding error : input conversion failed due to input error, bytes 0x9D 0x29 0x2E 0x20
encoding error : input conversion failed due to input error, bytes 0x9D 0x29 0x2E 0x20

基本上錯誤是:

編碼錯誤:由於輸入錯誤,輸入轉換失敗,字節0x9D 0x29 0x2E 0x20

編碼錯誤:由於輸入錯誤,輸入轉換失敗,字節0x9D 0x29 0x2E 0x20

我在谷歌搜索,結果與BeautifulSoup有關,但是當我在終端上顯示輸出時腳本正在工作,所以我不確定問題是否與BeautifulSoup有關。 這是我的代碼:

scrape.py

#!/usr/bin/env python
import requests
import sys
import warnings
import signal
from bs4 import BeautifulSoup
import flask
from flask import Flask, render_template, jsonify, request

app = Flask(__name__)
# app.config["DEBUG"] = True

@app.route('/')
def index():
    return render_template('home.html')


# #To avoid warning messages 
# warnings.filterwarnings("ignore")

#Main function.
@app.route('/genData')
def scrape():
        status = request.args.get('jsdata')

    url = "https://summerofcode.withgoogle.com/archive/2018/organizations/"
    default = "https://summerofcode.withgoogle.com"

    genData_list = []

    if status: 
        response = requests.get(url)
        html = response.content
        soup = BeautifulSoup(html, 'lxml')
        orgs = soup.findAll('li', attrs={'class': 'organization-card__container'})

        for org in orgs:
            link = org.find('a', attrs={'class': 'organization-card__link'})
            org_name = org['aria-label']
            org_link = default + link['href']
            response = requests.get(org_link)
            html = response.content
            soup = BeautifulSoup(html)
            tags = soup.findAll('li', attrs={
                'class': 'organization__tag organization__tag--technology'
                }
            )
            description_element = soup.find('div', attrs={'class': 'org__long-description'})
            description = description_element.p.text

            mdButton = soup.findAll('md-button', attrs={'class': 'md-primary org__meta-button'})

            contact = "No contact info available"
            for link in mdButton:
                if hasattr(link, 'href'):
                    if 'mailto:' in link['href']:
                        contact = link['href']
            tech = []
            for tag in tags:
                tech.append(tag.text)

            output_dict = {
                "organization" : org_name,
                "link" : org_link,
                "description" : description,
                "technologies" : tech,
                "contact" : contact
            }
            output = jsonify(output_dict)
            genData_list.append(output)

    return render_template('genData.html', genData=genData_list)


if __name__ == '__main__':
    app.run(debug=True)

home.html的

<!DOCTYPE html>
<html>
    <head>
        <title>Organisation Data</title>
    </head>
    <body>

    <input type="button" id="start_output" value="START"></input>
    <div id="place_for_genData"></div>
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.1.1/jquery.min.js"></script>

    <script>
    $("#start_output").click(function(){
        var status = true;

        $.ajax({
        url: "/genData",
        type: "get",
        data: {jsdata: status},
        success: function(response) {
            $("#place_for_genData").html(response);
        },
        error: function(xhr) {
            //Do Something to handle error
        }
        });
    });
    </script>
    </body>
</html>

genData.html

<label id="value_lable">
{% for data in genData %}
    {{ data }}<br>
{% endfor %}
</label>

編輯:這是在終端上打印輸出的原始腳本。

原始的 Scrape.py

#!/usr/bin/env python
import requests
import sys
import warnings
import signal
from bs4 import BeautifulSoup
import flask
import json
from flask import Flask, render_template, jsonify, request

# app = Flask(__name__)
# # app.config["DEBUG"] = True

# @app.route('/')
# def index():
#     return render_template('home.html')


# #To avoid warning messages 
# warnings.filterwarnings("ignore")

#Main function.
# @app.route('/genData')
def scrape():
    # status = request.args.get('jsdata')

    url = "https://summerofcode.withgoogle.com/archive/2018/organizations/"
    default = "https://summerofcode.withgoogle.com"

    genData_list = []


    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html, 'lxml')
    orgs = soup.findAll('li', attrs={'class': 'organization-card__container'})

    for org in orgs:
        link = org.find('a', attrs={'class': 'organization-card__link'})
        org_name = org['aria-label']
        org_link = default + link['href']
        response = requests.get(org_link)
        html = response.content
        soup = BeautifulSoup(html)
        tags = soup.findAll('li', attrs={
                'class': 'organization__tag organization__tag--technology'
            }
        )
        description_element = soup.find('div', attrs={'class': 'org__long-description'})
        description = description_element.p.text

        mdButton = soup.findAll('md-button', attrs={'class': 'md-primary org__meta-button'})

        contact = "No contact info available"
        for link in mdButton:
            if hasattr(link, 'href'):
                if 'mailto:' in link['href']:
                    contact = link['href']
        tech = []
        for tag in tags:
            tech.append(tag.text)

        output_dict = {
            "organization" : org_name,
            "link" : org_link,
            "description" : description,
            "technologies" : tech,
            "contact" : contact
        }
        output = json.dumps(output_dict)
        print(output)
        # genData_list.append(output)



if __name__ == '__main__':
    scrape()

OUTPUT

(venv) astanwar99@astanwar99-Predator-G3-572:~/DEVSPACE/WebDev/Web_Scrap_GSOC/GSoC-OrganisationScraper$ python temp.py
temp.py:44: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually 
isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 44 of the file temp.py. To get rid of this warning, pass the additional argument 'features="lxml"' to 
the BeautifulSoup constructor.

soup = BeautifulSoup(html)
{"organization": "3DTK", "technologies": ["c/c++", " cmake", "opencv", "ros", "boost"], "contact": "mailto:johannes.schauer@uni-wuerzburg.de", 
"link": "https://summerofcode.withgoogle.com/archive/2018/organizations/5685665089978368/", "description": "The 3D Toolkit is a collection of 
programs that allow working with 3D point cloud data. The tools include a powerful and efficient 3D point cloud viewer called \"show\" which is 
 able to open point clouds containing millions of points even on older graphics cards while still providing high frame rates. It provides bindings 
 for ROS, the Robotic Operating System and for Python, the programming language. Most of the functionality of 3DTK is provided in the form of 
 \"tools\", hence the name which are executed on the command line. These tools are able to carry out operations like simultaneous localization and 
 mapping (SLAM), plane detection, transformations, surface normal computation, feature detection and extraction, collision detection and dynamic 
 object removal. We support Linux, Windows and MacOS. 3DTK contains the implementation of several complex algorithms like multiple SLAM and ICP 
 implementations as well as several data structures like k-d trees, octrees, sphere quadtrees and voxel grids. The software is home of the 
 implementation of algorithms from several high impact research papers. While the Point Cloud Library (PCL) might be dead, 3DTK is alive and 
 actively maintained by an international team of skilled researchers from all over the world, ranging from Europe to China. Know-how from 3DTK 
 influenced several businesses from car manufacturers to mineral excavation or archaeological projects."}

隨意建議一些其他替代或解決方案。 我只想在localhost上顯示我的輸出。

從警告看起來您需要每次都指定解析器

soup = BeautifulSoup(html, 'lxml')

你有一行當前讀取( for org in orgs:內部):

soup = BeautifulSoup(html)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM