简体   繁体   中英

encoding issue when reading CSV file with python

I have hit a road block when trying to read a CSV file with python.

UPDATE: if you want to just skip the character or error you can open the file like this:

with open(os.path.join(directory, file), 'r', encoding="utf-8", errors="ignore") as data_file:

So far I have tried.

for directory, subdirectories, files in os.walk(root_dir):
    for file in files:
        with open(os.path.join(directory, file), 'r') as data_file:
            reader = csv.reader(data_file)
            for row in reader:
                print (row)

the error I am getting is:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

I have Tried

with open(os.path.join(directory, file), 'r', encoding="UTF-8") as data_file:

Error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2026' in position 223: character maps to <undefined>

Now if I just print the data_file it says they are cp1252 encoded but if I try

with open(os.path.join(directory, file), 'r', encoding="cp1252") as data_file:

The error I get is:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

I also tried the recommended package.

The error I get is:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 224-225: character maps to <undefined>

The line I am trying to parse is:

2015-11-28 22:23:58,670805374291832832,479174464,"MarkCrawford15","RT @WhatTheFFacts: The tallest man in the world was Robert Pershing Wadlow of Alton, Illinois. He was slighty over 8 feet 11 inches tall.","None

any thoughts or help is appreciated.

I would use csvkit , that uses automatic detection of apposite encoding and decoding. eg

import csvkit
reader = csvkit.reader(data_file)

As disscussed in the chat- solution is-

for directory, subdirectories, files in os.walk(root_dir): 
    for file in files: 
        with open(os.path.join(directory, file), 'r', encoding="utf-8") as data_file: 
            reader = csv.reader(data_file) 
            for row in reader: 
                data = [i.encode('ascii', 'ignore').decode('ascii') for i in row] 
                print (data)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM