简体   繁体   中英

Decoding UTF8 literals in a CSV file

Question:

Does anyone know how I could transform this b"it\\\\xe2\\\\x80\\\\x99s time to eat" into this it's time to eat


More details & my code:

Hello everyone,

I'm currently working with a CSV file which full of rows with UTF8 literals in them, for example:

b"it\\xe2\\x80\\x99s time to eat"

The end goal is to to get something like this:

it's time to eat

To achieve this I have tried using the following code:

import pandas as pd


file_open = pd.read_csv("/Users/Downloads/tweets.csv")

file_open["text"]=file_open["text"].str.replace("b\'", "")

file_open["text"]=file_open["text"].str.encode('ascii').astype(str)

file_open["text"]=file_open["text"].str.replace("b\"", "")[:-1]

print(file_open["text"])

After running the code the row that I took as an example is printed out as:

it\\xe2\\x80\\x99s time to eat

I have tried solving this issue using the following code to open the CSV file:

file_open = pd.read_csv("/Users/Downloads/tweets.csv", encoding = "utf-8")

which printed out the example row in the following manner:

it\\xe2\\x80\\x99s time to eat

and I have also tried decoding the rows using this:

file_open["text"]=file_open["text"].str.decode('utf-8')

Which gave me the following error:

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

Thank you very much in advance for your help.

b"it\\\\xe2\\\\x80\\\\x99s time to eat" sounds like your file contains an escaped encoding.

In general, you can convert this to a proper Python3 string with something like:

x = b"it\\xe2\\x80\\x99s time to eat"
x = x.decode('unicode-escape').encode('latin1').decode('utf8')
print(x)     # it’s time to eat

(Use of .encode('latin1') explained here )

So, if after you use pd.read_csv(..., encoding="utf8") you still have escaped strings, you can do something like:

pd.read_csv(..., encoding="unicode-escape")
# ...
# Now, your values will be strings but improperly decoded:
#    itâs time to eat
#
# So we encode to bytes then decode properly:
val = val.encode('latin1').decode('utf8')
print(val)   # it’s time to eat

But I think it's probably better to do this to the whole file instead of to each value individually, for example with StringIO (if the file isn't too big):

from io import StringIO

# Read the csv file into a StringIO object
sio = StringIO()
with open('yourfile.csv', 'r', encoding='unicode-escape') as f:
    for line in f:
        line = line.encode('latin1').decode('utf8')
        sio.write(line)
sio.seek(0)    # Reset file pointer to the beginning

# Call read_csv, passing the StringIO object
df = pd.read_csv(sio, encoding="utf8")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM