简体   繁体   English

从Python中的Unicode Web Scrape输出ascii文件

[英]Output ascii file from Unicode Web Scrape in Python

I am new to Python programming. 我是Python编程的新手。 I am using the following code in my Python file: 我在我的Python文件中使用以下代码:

import gethtml
import articletext
url = "http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece"
result = articletext.getArticle(url)
text_file = open("Output.txt", "w")

text_file.write(result)

text_file.close()

the file articletext.py contains the following code : articletext.py文件包含以下代码:

from bs4 import BeautifulSoup
import gethtml
def getArticleText(webtext):
    articletext = ""
    soup = BeautifulSoup(webtext)
    for tag in soup.findAll('p'):
        articletext += tag.contents[0]
    return articletext

def getArticle(url):
    htmltext = gethtml.getHtmlText(url)
    return getArticleText(htmltext)

But I am getting the following error : 但是我收到以下错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 473: ordinal not in range(128)
To print the result into the output file, what proper code should I write ?

The output `result` is text in the form of a paragraph.

To take care of the unicode error, we need to encode the text as unicode (UTF-8 to be precise) instead of ascii. 为了处理unicode错误,我们需要将文本编码为unicode(精确地说是UTF-8)而不是ascii。 To ensure it doesn't throw an error if there's an encoding error, we're going to ignore any characters that we don't have a mapping for. 如果存在编码错误,为了确保它不会抛出错误,我们将忽略任何我们没有映射的字符。 (You can also use "replace" or other options given by str.encode. See the Python docs on Unicode here. ) (您也可以使用str.encode提供的“替换”或其他选项。 请参阅Unicode上的Python文档。

Best practice in opening the file would be to use the Python context manager, which will close the file even if there's an error. 打开文件的最佳做法是使用Python上下文管理器,即使出现错误也会关闭文件。 I'm using slashes instead of backslashes in the path to make sure this works in either Windows or Unix/Linux. 我在路径中使用斜杠而不是反斜杠,以确保它适用于Windows或Unix / Linux。

text = text.encode('UTF-8', 'ignore')
with open('/temp/Out.txt', 'w') as file:
    file.write(text)

This is equivalent to 这相当于

text = text.encode('UTF-8', 'ignore')
try:
    file = open('/temp/Out.txt', 'w')
    file.write(text)
finally:
    file.close()

But the context manager is much less verbose and much less open to possibility of causing you to lock up a file in the middle of an error. 但是上下文管理器的冗长程度要小得多,并且不太可能导致您在错误中锁定文件。

text_filefixed = open("Output.txt", "wb")
text_filefixed.write(bytes(result, 'UTF-8')) 
text_filefixed.close()

This should work, give it a try. 这应该工作,试一试。

Why? 为什么? Because saving everything as bytes and utf-8 it will ignore those kind of encoding errors :D 因为将所有内容保存为字节和utf-8,它将忽略这些编码错误:D

Edit Make sure the file exists in the same folder, otherwise put this code after the imports and it should create the file itself. 编辑确保文件存在于同一文件夹中,否则将此代码放在导入之后,它应该自己创建文件。

text_filefixed = open("Output.txt", "a")
text_filefixed.close()

It creates it, saves nothing, close file... but it's created automatically without human interaction. 它创建它,不保存任何内容,关闭文件......但它是在没有人工交互的情况下自动创建的。

Edit2 Notice this is only working in 3.3.2 but i know you can use this module to achieve the same thing in 2.7. Edit2注意这只适用于3.3.2,但我知道你可以使用这个模块在2.7中实现相同的功能。 A few minor differences would be that (i think) request is not needed in 2.7, but you should check that. 一些细微差别是(我认为)2.7中不需要请求,但你应该检查一下。

from urllib import request
result = str(request.urlopen("http://www.thehindu.com/news/national/india-calls-for-resultoriented-steps-at-asem/article5339414.ece").read())
text_filefixed = open("Output.txt", "wb")
text_filefixed.write(bytes(result, 'UTF-8')) 
text_filefixed.close()

Just as i though, you will just find this error in 2.7, urllib.request in Python 2.7 就像我一样,你只会在2.7中找到这个错误, 在Python 2.7中找到urllib.request

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM