简体   繁体   中英

Removing non-ASCII characters from file text

Python experts:

I have a sentence like: "this time air\æ\ão was filled\ão" I wish to remove the non-Ascii unicode characters. I can just the following code and function:

def removeNonAscii(s): 
    return "".join(filter(lambda x: ord(x)<128, s))          

sentence = "this time air\u00e6\u00e3o was filled\u00e3o"   
sentence = removeNonAscii(sentence)
print(sentence)

then it shows up: "this time airo was filledo" , works great to remove "\\00.." but when I write the sentence in a file, and then read it and make as a loop:

def removeNonAscii(s):
    return "".join(filter(lambda x: ord(x)<128, s))

hand = open('test.txt')
for sentence in hand:
    sentence = removeNonAscii(sentence)
    print(sentence)

it shows "this time air\æ\ão was filled\£o" it doesn't work at all. What happens here? if the function works, it should not be that way....

I have a feeling that instead of having the actual non-ascii characters, the text in your file is actually displaying the utf-8 sequence for the character, ie instead of whatever character you think is there, it is actually the code \\u00-- and so when you run your code, it reads every character and sees that they are completely fine so the filter leaves them.

IF this is the case, use this:

import re
def removeNonAscii(s):
    return re.sub(r'\\u\w{4}','',s)

and it will take away all instances of '\\u----'

example:

>>> with open(r'C:\Users\...\file.txt','r') as f:
    for line in f:
        print(re.sub(r'\\u\w{4}','',line))
this time airo was filledo

where file.txt has:

this time air\æ\ão was filled\£o

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM