简体   繁体   English

如何使用 Python 从文本文件中删除非 ASCII 字符并将文件转换为字符串

[英]How to remove non-ASCII characters from a text file using Python and turning the file into a String

import re
data2 = ''
file = open('twitter.txt', 'r')
for i in file:
    thing = re.sub(r'[^\x00-\x7f]',r'', str(file[i]))
    print(str(thing))

Hi, I'm very new to Python.嗨,我对 Python 很陌生。 After scraping a bunch of data from Twitter using Python, I put the data into a text file.在使用 Python 从 Twitter 抓取一堆数据后,我将数据放入一个文本文件中。 The text file ends up with a lot of emojis and other non-ASCII characters that can't be turned into a String.文本文件最终包含许多无法转换为字符串的表情符号和其他非 ASCII 字符。 The above code is my attempt to remove the non-ASCII characters and turn the file into a String, but it ends up giving me the error:上面的代码是我尝试删除非 ASCII 字符并将文件转换为字符串,但它最终给了我错误:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1607: character maps to <undefined>

How can I remove the non-ASCII characters then turn the remaining text into a String?如何删除非 ASCII 字符然后将剩余的文本转换为字符串?

~Python 3.6 ~Python 3.6

def return_only_ascii(str)
    return ''.join([x for x in str if ord(x) < 128])

Python 3.7~蟒蛇 3.7~

def return_only_ascii(str)
    return ''.join([x for x in str if x.isascii()])

Result结果

>>> return_only_ascii('José')
'Jos'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM