简体   繁体   English

如何使用python读取从网络复制到txt文件的文本

[英]how to read text copied from web to txt file using python

I'm learning how to read text files.我正在学习如何阅读文本文件。 I used this way:我用这种方式:

f=open("sample.txt")

print(f.read())

It worked fine if I typed the txt file myself.如果我自己输入 txt 文件,它工作正常。 But when I copied text from a news article on the web, it produced the following error:但是当我从网络上的一篇新闻文章中复制文本时,它产生了以下错误:

UnicodeEncodeError: 'charmap' codec can't encode charater '\u2014' in position 738: character maps to undefined

I tried changing the Encoding setting in Notepad++ to UTF-8 as I read somewhere it is due to that当我在某处阅读时,我尝试将 Notepad++ 中的编码设置更改为 UTF-8,这是因为

I also tried using:我也尝试使用:

f=open("sample.txt",encoding='utf-8')

from here这里

But it still didn't work.但它仍然没有奏效。

You're on Windows and trying to print to the console.您使用的是 Windows 并尝试打印到控制台。 The print() is throwing the exception. print() 正在抛出异常。

The Windows console only natively supports 8bit code pages, so anything outside of your region will break (despite what people say about chcp 65001). Windows 控制台仅在本机支持 8 位代码页,因此您所在地区以外的任何内容都会中断(尽管人们对 chcp 65001 有什么看法)。

You need to install and use https://github.com/Drekin/win-unicode-console .您需要安装和使用https://github.com/Drekin/win-unicode-console This module talks at a low-level to the console API, giving support for multi-byte characters, for input and output.该模块在低级别与控制台 API 对话,支持多字节字符,用于输入和输出。

Alternatively, don't print to the console and write your output to a file, opened with an encoding.或者,不要打印到控制台并将输出写入文件,并用编码打开。 For example:例如:

with open("myoutput.log", "w", encoding="utf-8") as my_log:
    my_log.write(body)

Ensure you open the file with the correct encoding.确保使用正确的编码打开文件。

I assume that you are using Python 3 from the open and print syntax you use.我假设您正在使用 Python 3 中的openprint语法。

The offending character u"\—" is an em-dash ( ref ).冒犯性的字符 u"\—" 是一个破折号 ( ref )。 As I assume you are using Windows, maybe setting the console in UTF8 (chcp 65001) could help provided you use a not too old version.我假设您使用的是 Windows,如果您使用的版本不是太旧,那么在 UTF8 (chcp 65001) 中设置控制台可能会有所帮助。

If it is a batch script, and if the print is only here to get traces, you could use explicit encoding with error='replace'.如果它是一个批处理脚本,并且如果打印只是为了获取跟踪,则可以使用带有 error='replace' 的显式编码。 For example assuming that you console uses code page 850:例如,假设您的控制台使用代码页 850:

print(f.read().encode('cp850', 'replace'))

This will replace all unmapped characters with ?这将替换所有未映射的字符? - not very nice, but at least it does not raise... - 不是很好,但至少它不会提高......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM