Pandas read_csv() with HTML special characters

Question

I am cleaning up a CSV file in Python/Pandas, comma delimited.

Some of the cells have & as part of the text. When I run read_csv(), it is seeing that semicolon as the end of the current cell and offsetting the rest of the row.

I've tried encoding='utf8' and various other options...

EDIT** My code:

file = pd.read_csv('my-data-1.csv', encoding = 'utf8',index_col=False, low_memory=False)

file.drop(file.tail(1).index,inplace=True) #removing copyright line at the end


file_drop_dupes = file.drop_duplicates(['Project Id']) #drop the duplicates based on column Project Id

#drop all columns except these few
keep_col = ['Project Id','Project Name', 'Type']
new_file = file_drop_dupes[keep_col]
#write the result to a new csv file
new_file.to_csv('all-good-1.csv', index=False)

an example of field with HTML:

Service Maintenance &amp; Supply

Answer 1

In python 3.4+, it's a simple html.unescape() . Before that, html.parser's HTMLParser.unescape() . See this answer .

Answer 2

如果您使用的是python 3+ html.unescape()是解决方案

Pandas read_csv() with HTML special characters

Question

2 answers

solution1
0 2018-02-15 16:31:31

solution2
0 2018-02-15 16:34:03

Pandas read_csv() with HTML special characters

Question

2 answers

solution1 0 2018-02-15 16:31:31

solution2 0 2018-02-15 16:34:03

solution1
0 2018-02-15 16:31:31

solution2
0 2018-02-15 16:34:03