简体   繁体   中英

Python: Converting Mixed Decoded UTF-8 Characters to Text

Using a RESTful service, I have a Python 3.x script download text data from a vendor and land it to a text file. The data contains text that includes decoded UTF-8 characters. Here's an example of the text I receive:

b'Sample data plus some Japanese characters \xe3\x81\xaa\xe3\x81\x9c\xe6\x97\xa5\xe9\x8a\x80\xe3\x81\xa0\xe3\x81\x91\xe9\x81\x95\xe3\x81\x86\xe3\x81\xae\xe3\x81\x8b\xef\xbc\x9f
\xe2\x80\x94\x80\x94\x80\x94\x80\x94 and then more data'

Note that this is stored in a variable, say str_data . I'd like to convert those decoded characters before storing it into a database. When I check type(str_data) I get: <class 'str'> even though it has <class 'byte'> type structure (eg, b'stuff'). I have tried everything I can think of: encode(), decode(), etc. but to no avail. The output I want is this:

Sample data plus some Japanese characters なぜ日銀だけ違うのか?— and then more data

Any help would be great. Thank you.

Update

If it will help, here's how I pulled down the data.

  resp = requests.get(get_url)
  f = open(self.export_file, "w")
  f.write(str(resp.content))
  f.close() 

If I don't use str() on my write, like so...

  resp = requests.get(get_url)
  f = open(self.export_file, "w")
  **f.write(resp.content)**
  f.close() 

I get the following...

TypeError: write() argument must be str, not bytes

Some of the bytes in that string are not UTF-8 encoded, that's why you're having trouble. The Japanese characters are though.

>>> import ast
>>> ast.literal_eval(str_data).decode('utf-8', errors='replace')
'Sample data plus some Japanese characters なぜ日銀だけ違うのか?—������ and then more data'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM