简体   繁体   中英

Python reading csv problem : extra whitespace

When I tried to parse a csv which was exported by MS SQL 2005 express edition's query, the string python gives me is totally unexpected. For example if the line in the csv file is :" aaa,bbb,ccc,dddd", then when python parsed it as string, it becomes :" aaaa , bbb , ccc, dddd" something like that.....What happens???
I tried to remove the space in the code but don't work.

import os
import random


f1 = open('a.txt', 'r')
f2 = open('dec_sql.txt', 'w')
text = 'abc'

while(text  != ''):

    text = f1.readline()
    if(text==''):
        break

    splited = text.split(',')
    for i in range (0, 32):
        splited[i] = splited[i].replace(' ', '')

    sql = 'insert into dbo.INBOUND_RATED_DEC2010 values ('
    sql += '\'' + splited[0] + '\', '
    sql += '\'' + splited[1] + '\', '
    sql += '\'' + splited[2] + '\', '
    sql += '\'' + splited[3] + '\', '
    sql += '\'' + splited[4] + '\', '
    sql += '\'' + splited[5] + '\', '
    sql += '\'' + splited[6] + '\', '
    sql += '\'' + splited[7] + '\', '
    sql += '\'' + splited[8] + '\', '
    sql += '\'' + splited[9] + '\', '
    sql += '\'' + splited[10] + '\', '
    sql += '\'' + splited[11] + '\', '
    sql += '\'' + splited[12] + '\', '
    sql += '\'' + splited[13] + '\', '
    sql += '\'' + splited[14] + '\', '
    sql += '\'' + splited[15] + '\', '
    sql += '\'' + splited[16] + '\', '
    sql += '\'' + splited[17] + '\', '
    sql += '\'' + splited[18] + '\', '
    sql += '\'' + splited[19] + '\', '
    sql += '\'' + splited[20] + '\', '
    sql += '\'' + splited[21] + '\', '
    sql += '\'' + splited[22] + '\', '
    sql += '\'' + splited[23] + '\', '
    sql += '\'' + splited[24] + '\', '
    sql += '\'' + splited[25] + '\', '
    sql += '\'' + splited[26] + '\', '
    sql += '\'' + splited[27] + '\', '
    sql += '\'' + splited[28] + '\', '
    sql += '\'' + splited[29] + '\', '
    sql += '\'' + splited[30] + '\', '
    sql += '\'' + splited[31] + '\', '
    sql += '\'' + splited[32] + '\' '

    sql += ')'

    print sql
    f2.write(sql+'\n')


f2.close()
f1.close()

Sounds to me like the output of the MS SQL 2005 query is a unicode file. The python csv module cannot handle unicode files, but there is some sample code in the documentation for the csv module describing how to work around the problem.

Alternately, some text editors allow you to save a file with a different encoding. For example, I opened the results of a MS SQL 2005 query in Notepad++ and it told me the file was UCS-2 encoded and I was able to convert it to UTF-8 from the Encoding menu.

尝试在记事本中打开文件,然后使用全部替换功能将' '替换为''

Your file is most likely encoded with a 2byte character encoding - most likely utf-16 (but it culd be some other encoding.

To get the CSV proper reading it, you'd open it with a codec so that it is decoded as its read - doing that you have Unicode objects (not string objects) inside your python program.

So, instead of opening the file with

my_file = open ("data.dat", "rt")

Use: import codecs

my_file = codecs.open("data.dat", "rt", "utf-16") 

And then feed this to the CSV module, with:

import csv reader = csv.reader(my_file) first_line = False for line in reader: if first_line: #skips header line first_line = True continue #assemble sql query and issue it

Another thing is that your "query" being constructed into 32 lines of repetitive code is a nice thing to do when programing. Even in languages that lack rich string processing facilities, there are better ways to do it, but in Python, you can simply do:

sql = 'insert into dbo.INBOUND_RATED_DEC2010 values (%s);' % ", ".join("'%s'" % value for value in splited )

Instead of those 33 lines assembling your query. (I am telling it to insert a string inside the parentheses on the first string. After the % operator, the string ", " is used with the "join" method so that it is used to paste together all elements on the sequence passed as a parameter to join. This sequence is made of a string, containing a value enclosed inside single quotes for each value in your splited array.

It may help to use Python's built in CSV reader. Looks like an issue with unicode, a problem that frustrated me a lot.

import tkFileDialog
import csv

ENCODING_REGEX_REPLACEMENT_LIST = [(re.compile('\xe2\x80\x99'), "'"),
                                   (re.compile('\xe2\x80\x94'), "--"),
                                   (re.compile('\xe2\x80\x9c'), '"'),
                                   (re.compile('\xe2\x80\x9d'), '"'),
                                   (re.compile('\xe2\x80\xa6'), '...')]

def correct_encoding(csv_row):
    for key in csv_row.keys():
        # if there is a value for the current key
        if csv_row[key]:
            try:
                csv_row[key] = unicode(csv_row[key], errors='strict')
            except ValueError:
                # we have a bad encoding, try iterating through all the known
                # bad encodings in the ENCODING_REGEX_REPLACEMENT and replace
                # everything and then try again
                for (regex, replacement) in ENCODING_REGEX_REPLACEMENT_LIST:
                    csv_row[key] = regex.sub(replacement,csv_row[key])
                print(csv_row)
                csv_row[key] = unicode(csv_row[key])
        # if there is NOT a value for the current key
        else:
            csv_row[key] = unicode('')
    return csv_row

filename = tkFileDialog.askopenfilename()
csv_reader = csv.DictReader(open(filename, "rb"), dialect='excel') # assuming similar dialect
for csv_row in csv_reader:
    csv_row = correct_encoding(csv_row)
    # your application logic here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM