My csv starts out like this:
,index,spotify_id,artist_name,track_name,album_name,duration_ms,lyrics,lyrics_bert_embeddings
0,0,5Jk0vfT81ltt2rYyrWDzZ5,Hundred Waters,Xtalk - Kodak to Graph Remix,The Moon Rang Like a Bell,285327,not fetched,"[ 0.00722605 -0.23726921 0.15163635 -0.28774077 0.07081255 0.26606813
each row ends like this in a new line:
0.03439684 -0.29289168 0.13590978 0.2332756 -0.24305075 0.2034984 ]"
These values are from a big numpy array encoded with np.array2string()
and span multiple lines in the csv.
When using pd.read_csv
it throws an "ParserError: Error tokenizing data. C error: EOF inside string starting at row 90607".
When using the parameter engine="python"
it throws an "ParserError: unexpected end of data"
. When using the seperator sep= '\t+'
it just puts each line in a new row in the dataframe. When using csv.reader
by using with open(file_path)
and then iterating through each line, the same happens as with the sep='\t+'
.
Is there a way to automatically append each row to the original row it belongs to or do I have to preprocess this by hand?
I could see that your csv data has strings in it. You could try using quoting parameter with the value 'QUOTE_NONE' as follows ( Pandas ParserError EOF character when reading multiple csv files to HDF5 ),
import csv
csvfile = 'Path/to/csv/file'
pd.read_csv(csvfile,quoting=csv.QUOTE_NONE)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.