简体   繁体   中英

pandas reading csv with one row spanning multiple lines

My csv starts out like this:

,index,spotify_id,artist_name,track_name,album_name,duration_ms,lyrics,lyrics_bert_embeddings
0,0,5Jk0vfT81ltt2rYyrWDzZ5,Hundred Waters,Xtalk - Kodak to Graph Remix,The Moon Rang Like a Bell,285327,not fetched,"[ 0.00722605 -0.23726921  0.15163635 -0.28774077  0.07081255  0.26606813

each row ends like this in a new line:

  0.03439684 -0.29289168  0.13590978  0.2332756  -0.24305075  0.2034984 ]"

These values are from a big numpy array encoded with np.array2string() and span multiple lines in the csv.

When using pd.read_csv it throws an "ParserError: Error tokenizing data. C error: EOF inside string starting at row 90607". When using the parameter engine="python" it throws an "ParserError: unexpected end of data" . When using the seperator sep= '\t+' it just puts each line in a new row in the dataframe. When using csv.reader by using with open(file_path) and then iterating through each line, the same happens as with the sep='\t+' .

Is there a way to automatically append each row to the original row it belongs to or do I have to preprocess this by hand?

I could see that your csv data has strings in it. You could try using quoting parameter with the value 'QUOTE_NONE' as follows ( Pandas ParserError EOF character when reading multiple csv files to HDF5 ),

import csv
csvfile = 'Path/to/csv/file'
pd.read_csv(csvfile,quoting=csv.QUOTE_NONE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM