简体   繁体   中英

(Python) Write string to CSV with UTF-8 Encoding

I have data in the database and I want to export it to CSV file. The data is in Japanese and for further use, I need it to be encoded in UTF-8

This is my script to get the data and write in the CSV file

import mysql.connector
from mysql.connector import errorcode

import sys
import csv

query = 'SELECT * FROM `images-data`'

try:
    cnx = mysql.connector.connect(user='root', password='1234',
                                 host='127.0.0.1',
                                 database='sotsuken-test-db')
    cur=cnx.cursor()
    cur.execute(query)
    result=cur.fetchall()

    c = csv.writer(open("db-data.csv","w"))
    for row in result:
        c.writerow(row)



except mysql.connector.Error as err:
    if err.errno == errorcode.ER_ACCESS_DENIED_ERROR:
        print("Something is wrong with your user name or password")
    elif err.errno == errorcode.ER_BAD_DB_ERROR:
        print("Database does not exist")
    else:
        print(err)
else:
    cnx.close()

I can create the CSV file, and the data is exported in UTF-8, but my CSV file's data was like this:

1,b'\xe3\x83\x86\xe3\x82\xb9\xe3\x83\x88\xe3\x81\xa7\xe3\x81\x94\xe3\x81\x8a\xe3\x81\x96\xe3\x81\x84\xe3\x81\xbe\xe3\x81\x99'

In my research I found that my data was written in byte object (may be I'm wrong here). I need the data to be written in UTF-8 without b''

I know that I can use decode("UTF-8") to make it into the string, but I cannot apply that to csv.writerow

Can you give me any suggestions?

csv.writer.writerow just takes a list of whatever you've got, 1 calls str on each element, and puts them together into a line of CSV.

So, if what you've got is bytes objects, then it's going to call str on them and get strings like b'\\xe3\\x83\\x86\\xe3\\x82\\xb9\\xe3\\x83\\x88\\xe3\\x81\\xa7\\xe3\\x81\\x94\\xe3\\x81\\x8a\\xe3\\x81\\x96\\xe3\\x81\\x84\\xe3\\x81\\xbe\\xe3\\x81\\x99' .

The solution is to give it strings.


The simplest way to do it is by decoding those bytes :

I know that I can use decode("UTF-8") to make it into the string, but I cannot apply that to csv.writerow

Sure you can. I don't know what your columns are, so I'll make up my own example:

for image_id, image_name in result:
    c.writerow([image_id, image_name.decode('UTF-8')])

That's really all there is to it.


What if your code has absolutely no idea what the columns in the database are? Well, in that case, you have to get a little bit clever. As long as we know that all of the bytes columns are really Unicode text in disguise as UTF-8, you can just switch on the type:

for row in result:
    textrow = [col.decode('UTF-8') if isinstance(col, bytes) else col for col in row]
    csv.writerow(textrow)

That's a bit ugly, but reading * from a database where you have no idea what the columns are is inherently an ugly problem.


However, there's probably a much better solution.

MySQL Connector/Python turns all CHAR , TEXT , and similar columns into Unicode str values by default. 2 But BINARY , BLOB , and similar columns are always returned as bytes .

If this column is meant to represent Unicode text, set it to a text type instead of a binary one in the database. Then you won't have a problem in the first place—in this script, or in any other tools.


1. Actually, it appears to be illegal to call it with a list of anything but strings and numbers, according to the docs. But practically, it takes anything at all; it just doesn't do anything very useful with any types besides strings and numbers…

2. That is, unless you explicitly pass it a use_unicode=False in the connect call or similarly set it elsewhere.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM