Python experts:
I have a sentence like: "this time air\æ\ão was filled\ão"
I wish to remove the non-Ascii unicode characters. I can just the following code and function:
def removeNonAscii(s):
return "".join(filter(lambda x: ord(x)<128, s))
sentence = "this time air\u00e6\u00e3o was filled\u00e3o"
sentence = removeNonAscii(sentence)
print(sentence)
then it shows up: "this time airo was filledo"
, works great to remove "\\00.." but when I write the sentence in a file, and then read it and make as a loop:
def removeNonAscii(s):
return "".join(filter(lambda x: ord(x)<128, s))
hand = open('test.txt')
for sentence in hand:
sentence = removeNonAscii(sentence)
print(sentence)
it shows "this time air\æ\ão was filled\£o"
it doesn't work at all. What happens here? if the function works, it should not be that way....
I have a feeling that instead of having the actual non-ascii
characters, the text in your file is actually displaying the utf-8 sequence for the character, ie instead of whatever character you think is there, it is actually the code \\u00--
and so when you run your code, it reads every character and sees that they are completely fine so the filter leaves them.
IF this is the case, use this:
import re
def removeNonAscii(s):
return re.sub(r'\\u\w{4}','',s)
and it will take away all instances of '\\u----'
example:
>>> with open(r'C:\Users\...\file.txt','r') as f:
for line in f:
print(re.sub(r'\\u\w{4}','',line))
this time airo was filledo
where file.txt has:
this time air\æ\ão was filled\£o
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.