简体   繁体   中英

How to read_csv with incorrectly formatted file

I have a text file as sample below:

col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,val E1, val E2, val E3
val A2,val B2, val C2,val D2, val E4

Please note that some values in col E has multiple values which contains "," eg val E1, val E2, val E3

When I use df = pd.read_csv(r'path/text_file.txt', sep="\t") , it reads as one column instead of multiple columns as below:

col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,val E1, val E2, val E3
val A2,val B2, val C2,val D2, val E4

The expected dataframe as below:

col A col B col C col D col E
val A1 val B1 val C1 val D1 val E1, val E2, val E3
val A2 val B2 val C2 val D2 val E4

I tried replacing the delimiter with "," instead of "\t" but it wouldn't work since in col E, I have multiple values which is separated by ",".

  • This solution is 80x faster than this solution , on a file with 31201 rows.
  • The file is not a correctly formatted csv file. Multiple comma separated values that belong in 1 column should be in double quotes like "val E1, val E2, val E3" .

Repair the data format

  1. .open the file and fix with a list comprehension
  2. Iterate through each row of strings with for l in f
  3. Split each row into a list with row:= l.strip().split(',') , which uses an assignment expression ( := ) and requires python >= 3.8
    • An option without := is at the bottom
  4. Fix the rows
    • [','.join(row[4:])] joins anything >= index 4 into a single string in a list, which is them combined back to the list of the first 4 values, row[:4] .
  5. Load into the dataframe
import pandas as pd

with open('test.txt') as f:
    rows = [row[:4] + [','.join(row[4:])] for l in f if (row := l.strip().split(',')) is not None] 

df = pd.DataFrame(rows[1:], columns=rows[0])

# display(df)
    col A   col B    col C   col D                   col E
0  val A1  val B1   val C1  val D1  val E1, val E2, val E3
1  val A2  val B2   val C2  val D2                  val E4

df.to_csv('test.txt', index=False)

# properly formatted csv
col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,"val E1, val E2, val E3"
val A2,val B2, val C2,val D2, val E4

%%timeit comparison

  • Performed on test.txt with 31201 rows
%%timeit
with open('test.txt') as f:
    rows = [row[:4] + [','.join(row[4:])] for l in f if (row := l.strip().split(',')) is not None]
df = pd.DataFrame(rows[1:], columns=rows[0])

[result]: 50.8 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df=pd.read_csv('test.txt', header=None, skiprows=1, engine='python')
cols=pd.read_csv('test.txt',skipfooter=len(df)).columns
df[4]=df.loc[:,4:].agg(lambda x:','.join(x.dropna()),1)
df=df.loc[:,:4]
df.columns=cols

[result]: 4.04 s ± 30 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Option without assignment expression

  • %%timeit of 54.3 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
with open('test.txt') as f:
    rows = list()
    for l in f:
        row = l.strip().split(',')
        row = row[:4] + [','.join(row[4:])]
        rows.append(row)
        
df = pd.DataFrame(rows[1:], columns=rows[0])

You can try formatting the file:

Firstly read your csv file without a header then get the cols name in your csv file after that join cells values of column 4: by ',' then get the df upto column 4 by using iloc and finally set the columns names and save the file to csv

df=pd.read_csv(r'path/text_file.txt',header=None,skiprows=1)
cols=pd.read_csv(r'path/text_file.txt',skipfooter=len(df)).columns
df[4]=df.loc[:,4:].agg(lambda x:','.join(x.dropna()),1)
df=df.loc[:,:4]
df.columns=cols
df.to_csv(r'path/text_file.txt',index=False)

output of csv file after formatting:

col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,"val E1, val E2, val E3"
val A2,val B2, val C2,val D2, val E4

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM