简体   繁体   中英

how do I remove commas within columns from data retrieved from a CSV file

I have several CSV files that I need to process. Within the columns of each, there might be commas in the fields. Strings might also be sitting within double quotes. I got it right to come up with something, but I am working with CSV files that are sometimes between 200 - 400 MB. Processing them with my current code lets a 11MB file take 4 minutes to be processed.

What can I do here to have it run faster or maybe to process the entire data all at once instead of running through the code field by field ?

import csv

def rem_lrspaces(data):
    data = data.lstrip()
    data = data.rstrip()
    data = data.strip()
    return data

def strip_bs(data):
    data = data.replace(",", " ")
    return data 

def rem_comma(tmp1,tmp2):

    with open(tmp2, "w") as f:
        f.write("")
        f.close()  

    file=open(tmp1, "r")
    reader = csv.reader(file,quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True)
    for line in reader:

        for field in line:
            if "," in field : 
                field=rem_lrspaces(strip_bs(field))

            with open(tmp2, "a") as myfile:

                myfile.write(field+",")

        with open(tmp2, "a") as myfile:
            myfile.write("\n")                    

pdfsource=r"C:\automation\cutoff\test2"
csvsource=pdfsource

ofn = "T3296N17"

file_in = r"C:\automation\cutoff\test2"+chr(92)+ofn+".CSV"
file_out = r"C:\automation\cutoff\test2"+chr(92)+ofn+".TSV"   

rem_comma(file_in,file_out)

A few low-hanging fruit:

  1. strip_bs is too simple to justify the overhead of calling the function.
  2. rem_lrspaces is redundantly stripping whitespace; one call to data.strip() is all you need, in which case it too is too simple to justify a separate function.
  3. You are also spending a lot of time repeatedly opening the output file.

Also, it's better to pass already-open file handles to rem_comma , as it makes testing easier by allowing in-memory file-like objects to be passed as arguments.

This code simply builds a new list of fields from each line, then uses csv.writer to write the new fields back to the output file.

import csv

def rem_comma(f_in, f_out):
        reader = csv.reader(f_in, quotechar='"', delimiter=',', quoting=csv.QUOTE_ALL, skipinitialspace=True)
        writer = csv.writer(f_out)

        for line in reader:
            new_line = [field.replace(",", " ").strip() for field in line]
            writer.write_row(new_line)   

ofn = "T3296N17"

file_in = r"C:\automation\cutoff\test2"+chr(92)+ofn+".CSV"
file_out = r"C:\automation\cutoff\test2"+chr(92)+ofn+".TSV"   

with open(file_in) as f1, open(file_out) as f2:
    rem_comma(f1, f2)    

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM