简体   繁体   中英

How to manage a problem reading a csv that is a semicolon-separated file where some strings contain semi-colons?

The problem I have can be illustrated by showing a couple of sample rows in my csv (semicolon-separated) file, which look like this:

4;1;"COFFEE; COMPANY";4
3;2;SALVATION ARMY;4

Notice that in one row, a string is in quotation marks and has a semi-colon inside of it (none of the columns have quotations marks around them in my input file except for the ones containing semicolons).

These rows with the quotation marks and semicolons are causing a problem -- basically, my code is counting the semicolon inside quotation marks within the column/field. So when I read in this row, it reads this semicolon inside the string as a delimiter, thus making it seem like this row has an extra field/column.

The desired output would look like this, with no quotation marks around "coffee company" and no semicolon between 'coffee' and 'company':

4;1;COFFEE COMPANY;4
3;2;SALVATION ARMY;4

Actually, this column with "coffee company" is totally useless to me, so the final file could look like this too:

4;1;xxxxxxxxxxx;4
3;2;xxxxxxxxxxx;4

How can I get rid of just the semi-colons inside of this one particular column, but without getting rid of all of the other semi-colons?

The csv module makes it relatively easy to handle a situation like this:

# Contents of input_file.csv
# 4;1;"COFFEE; COMPANY";4
# 3;2;SALVATION ARMY;4

import csv
input_file = 'input_file.csv'  # Contents as shown in your question.

with open(input_file, 'r', newline='') as inp:
    for row in csv.reader(inp, delimiter=';'):
        row[2] = row[2].replace(';', '')  # Remove embedded ';' chars.
        # If you don't care about what's in the column, use the following instead:
        # row[2] = 'xyz'  # Value not needed.
        print(';'.join(row))

Printed output:

4;1;COFFEE COMPANY;4
3;2;SALVATION ARMY;4

Follow-on question : How to write this data to a new csv file?

import csv
input_file = 'input_file.csv'  # Contents as shown in your question.
output_file = 'output_file.csv'

with open(input_file, 'r', newline='') as inp, \
     open(output_file, 'w', newline='') as outp:
    writer= csv.writer(outp, delimiter=';')
    for row in csv.reader(inp, delimiter=';'):
        row[2] = row[2].replace(';', '')  # Remove embedded ';' chars.
        writer.writerow(row)

Here's an alternative approach using the Pandas library which spares you having to code for loops:

import pandas as pd

#Read csv into dataframe df
df = pd.read_csv('data.csv', sep=';', header=None)
#Remove semicolon in column 2
df[2] = df[2].apply(lambda x: x.replace(';', ''))

This gives the following dataframe df:

   0  1               2  3
0  4  1  COFFEE COMPANY  4
1  3  2  SALVATION ARMY  4

Pandas provides several inbuilt functions to help you manipulate data or make statistical conclusions. Having the data in a tabular format can also make working with it more intuitive.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM