How to read_csv with incorrectly formatted file

Question

I have a text file as sample below:

col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,val E1, val E2, val E3
val A2,val B2, val C2,val D2, val E4

Please note that some values in col E has multiple values which contains "," eg val E1, val E2, val E3

When I use df = pd.read_csv(r'path/text_file.txt', sep="\t") , it reads as one column instead of multiple columns as below:

col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,val E1, val E2, val E3
val A2,val B2, val C2,val D2, val E4

The expected dataframe as below:

col A	col B	col C	col D	col E
val A1	val B1	val C1	val D1	val E1, val E2, val E3
val A2	val B2	val C2	val D2	val E4

I tried replacing the delimiter with "," instead of "\t" but it wouldn't work since in col E, I have multiple values which is separated by ",".

Answer 1

This solution is 80x faster than this solution , on a file with 31201 rows.
The file is not a correctly formatted csv file. Multiple comma separated values that belong in 1 column should be in double quotes like "val E1, val E2, val E3" .

Repair the data format

.open the file and fix with a list comprehension
Iterate through each row of strings with for l in f
Split each row into a list with row:= l.strip().split(',') , which uses an assignment expression ( := ) and requires python >= 3.8
- An option without := is at the bottom
Fix the rows
- [','.join(row[4:])] joins anything >= index 4 into a single string in a list, which is them combined back to the list of the first 4 values, row[:4] .
Load into the dataframe

import pandas as pd

with open('test.txt') as f:
    rows = [row[:4] + [','.join(row[4:])] for l in f if (row := l.strip().split(',')) is not None] 

df = pd.DataFrame(rows[1:], columns=rows[0])

# display(df)
    col A   col B    col C   col D                   col E
0  val A1  val B1   val C1  val D1  val E1, val E2, val E3
1  val A2  val B2   val C2  val D2                  val E4

df.to_csv('test.txt', index=False)

# properly formatted csv
col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,"val E1, val E2, val E3"
val A2,val B2, val C2,val D2, val E4

`%%timeit` comparison

Performed on test.txt with 31201 rows

%%timeit
with open('test.txt') as f:
    rows = [row[:4] + [','.join(row[4:])] for l in f if (row := l.strip().split(',')) is not None]
df = pd.DataFrame(rows[1:], columns=rows[0])

[result]: 50.8 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
df=pd.read_csv('test.txt', header=None, skiprows=1, engine='python')
cols=pd.read_csv('test.txt',skipfooter=len(df)).columns
df[4]=df.loc[:,4:].agg(lambda x:','.join(x.dropna()),1)
df=df.loc[:,:4]
df.columns=cols

[result]: 4.04 s ± 30 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Option without assignment expression

%%timeit of 54.3 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

with open('test.txt') as f:
    rows = list()
    for l in f:
        row = l.strip().split(',')
        row = row[:4] + [','.join(row[4:])]
        rows.append(row)
        
df = pd.DataFrame(rows[1:], columns=rows[0])

Answer 2

You can try formatting the file:

Firstly read your csv file without a header then get the cols name in your csv file after that join cells values of column 4: by ',' then get the df upto column 4 by using iloc and finally set the columns names and save the file to csv

df=pd.read_csv(r'path/text_file.txt',header=None,skiprows=1)
cols=pd.read_csv(r'path/text_file.txt',skipfooter=len(df)).columns
df[4]=df.loc[:,4:].agg(lambda x:','.join(x.dropna()),1)
df=df.loc[:,:4]
df.columns=cols
df.to_csv(r'path/text_file.txt',index=False)

output of csv file after formatting:

col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,"val E1, val E2, val E3"
val A2,val B2, val C2,val D2, val E4

How to read_csv with incorrectly formatted file

Question

2 answers

solution1
1 ACCPTED 2021-08-17 16:52:59

Repair the data format

`%%timeit` comparison

Option without assignment expression

solution2
0 2021-08-17 16:06:27

How to read_csv with incorrectly formatted file

Question

2 answers

solution1 1 ACCPTED 2021-08-17 16:52:59

Repair the data format

%%timeit comparison

Option without assignment expression

solution2 0 2021-08-17 16:06:27

solution1
1 ACCPTED 2021-08-17 16:52:59

`%%timeit` comparison

solution2
0 2021-08-17 16:06:27