简体   繁体   中英

Python: how to parse non-ASCII characters in string

In my Python script, I'm trying to read in a text file that contains columns with people's first and last names, some of which have non-ASCII characters like ñ . But when I do so, I get the error UnicodeDecodeError: 'utf8' codec can't decode byte 0x96 in position 66 .

From what I've been reading online, I know you can handle this problem by ignoring or dropping the non-ASCII characters, but I don't want to do that. Is there a straight-forward way of converting all non-ASCII characters in a file into a normal string?

Currently, I'm opening my file with infile = open(filename, 'rU') .

Not duplicate question : I'm asking about how to read in a file with unicode characters, not how to write unicode string out to a file.

  1. Make a copy of the file.
  2. Make sure that your file in unicode and find out which unicode format it uses. some simple editors like geany helps you to find the right encoding that was used on creation of the file. Split file if it is big and process a part of it by editors.
  3. Use the correct encoding (maybe it is old cp encoding) for opening the file and do file conversion to utf8. Or use a tool (like editor) to convert it to utf8

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM