简体   繁体   中英

Python: Read csv file of which one column contains multiple commas

I have utf-8 encoded comma-delimited csv file that one of the columns contains multiple commas however I need to import them as one column for further manipulation. The data frame looks like

C1 C2 C3 C4 C5 C6      C7.... C27
1, 2, 3, 4, 5, A,B,C,   2 .......
3, 5, 3, 4, 6, A,B,C,D, 8 .......
1, 2, 2, 5, 8, A,B,     7 .......
3, 5, 3, 4, 6, ABCDE,   8 .......
1, 2, 3, 4, 5, A,B,C,D  2 .......

So the column 6 contains some Chinese character as well as different number of commas. The columns 5 and 7 are all numeric. The data frame has 27 columns in total. I want the characters in the 6th columns treated as value in one cell instead of values for more than one variables.

I know that you can use quotation sign first but I'm wondering how exactly you would do it. I have more than 1000 files like this that I have to open.

Any suggestions would be appreciated!

A follow-up question: What if the number of columns are different for different files? Is it possible to use regular expression to define the pattern of columns and get the number of the columns first, and then decide how to split the columns?

I am thinking now to get the columns of each files first and save them to a csv file, and then use the method in the possible duplicate question. But any suggestions on a more efficient way would be appreciated!

Since you know what the desired number of rows are what you want to do is take the difference between the back of the row and the front using set(). You can just change the num_cols for other files.

import csv

filename = 'mycsv.csv'
num_cols = 26 # "The data frame has 27 columns in total"

with open(filename, newline='') as f:
    reader = csv.reader(f)
    for row in reader:
        try:
            assert len(row) >= num_cols, f'The csv file does not contain at least {num_cols} columns.'
            after_sixth = row[-21:] # everything after the '6th' column
            before_sixth = row[:5]
            everything_else = after_six + before_sixth
            sixth_row = set(row)- set(everything_else)
            new_row = before_sixth + sixth_row + everything_else
            print(new_row)
        except AssertionError as e:
            print(e)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM