I have a text file as sample below:
col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,val E1, val E2, val E3
val A2,val B2, val C2,val D2, val E4
Please note that some values in col E has multiple values which contains "," eg val E1, val E2, val E3
When I use df = pd.read_csv(r'path/text_file.txt', sep="\t")
, it reads as one column instead of multiple columns as below:
col A,col B,col C,col D,col E |
---|
val A1,val B1,val C1,val D1,val E1, val E2, val E3 |
val A2,val B2, val C2,val D2, val E4 |
The expected dataframe as below:
col A | col B | col C | col D | col E |
---|---|---|---|---|
val A1 | val B1 | val C1 | val D1 | val E1, val E2, val E3 |
val A2 | val B2 | val C2 | val D2 | val E4 |
I tried replacing the delimiter with "," instead of "\t" but it wouldn't work since in col E, I have multiple values which is separated by ",".
"val E1, val E2, val E3"
..open
the file and fix with a list comprehension for l in f
row:= l.strip().split(',')
, which uses an assignment expression ( :=
) and requires python >= 3.8
:=
is at the bottom[','.join(row[4:])]
joins anything >= index 4 into a single string in a list, which is them combined back to the list of the first 4 values, row[:4]
. import pandas as pd
with open('test.txt') as f:
rows = [row[:4] + [','.join(row[4:])] for l in f if (row := l.strip().split(',')) is not None]
df = pd.DataFrame(rows[1:], columns=rows[0])
# display(df)
col A col B col C col D col E
0 val A1 val B1 val C1 val D1 val E1, val E2, val E3
1 val A2 val B2 val C2 val D2 val E4
df.to_csv('test.txt', index=False)
# properly formatted csv
col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,"val E1, val E2, val E3"
val A2,val B2, val C2,val D2, val E4
%%timeit
comparison test.txt
with 31201 rows%%timeit
with open('test.txt') as f:
rows = [row[:4] + [','.join(row[4:])] for l in f if (row := l.strip().split(',')) is not None]
df = pd.DataFrame(rows[1:], columns=rows[0])
[result]: 50.8 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df=pd.read_csv('test.txt', header=None, skiprows=1, engine='python')
cols=pd.read_csv('test.txt',skipfooter=len(df)).columns
df[4]=df.loc[:,4:].agg(lambda x:','.join(x.dropna()),1)
df=df.loc[:,:4]
df.columns=cols
[result]: 4.04 s ± 30 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
of 54.3 ms ± 1.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
with open('test.txt') as f:
rows = list()
for l in f:
row = l.strip().split(',')
row = row[:4] + [','.join(row[4:])]
rows.append(row)
df = pd.DataFrame(rows[1:], columns=rows[0])
You can try formatting the file:
Firstly read your csv file without a header then get the cols name in your csv file after that join cells values of column 4: by ','
then get the df upto column 4 by using iloc and finally set the columns names and save the file to csv
df=pd.read_csv(r'path/text_file.txt',header=None,skiprows=1)
cols=pd.read_csv(r'path/text_file.txt',skipfooter=len(df)).columns
df[4]=df.loc[:,4:].agg(lambda x:','.join(x.dropna()),1)
df=df.loc[:,:4]
df.columns=cols
df.to_csv(r'path/text_file.txt',index=False)
output of csv file after formatting:
col A,col B,col C,col D,col E
val A1,val B1,val C1,val D1,"val E1, val E2, val E3"
val A2,val B2, val C2,val D2, val E4
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.