I have a file that I am trying to read into a Pandas DataFrame that has a column with a complex string in it. The string contains an HTML output and is similar to the following:
"<!DOCTYPE html PUBLIC \\\\"-//W3C//DTD HTML 4.0 Transitional//EN\\\\">\\n', '<html>\\n', '<head>\\n', '<meta http-equiv=\\\\"Content-Type\\\\" content=\\\\"text/html; charset=UTF-8\\\\">\\n', '<meta charset=\\\\"utf-8\\\\">\\n', '<title>An Amazon.com Gift Card you sent has been redeemed</title>\\n', '</head>\\n', '<body>\\n',
I have tried the following so far:
df = pd.read_csv("<filename>",nrows = 50)
Which returns the following .head()
:
I have tried using "escapechar= "
, but must not have gotten the syntax right.
To be clear, this HTML string will be one part of the overall CSV file, and the above string will be only one cell of a given row. See below for a sample row of the CSV file. There are 24 columns being served in this CSV:
"241279","EMAIL_ADDRESS","EMAIL_ADDRESS","1607be7d4f2d66af","<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"URL\">
<html>
<head>
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
<meta charset=\"utf-8\">
<title>An Amazon.com Gift Card you sent has been redeemed</title>
</head>
<body>
<img width=\"1\" height=\"1\" src=\"URL\">
Greetings from Amazon.com,<br><br>
We wanted to let you know you that an Amazon.com Gift Card you sent has been redeemed.<br><br>
The gift card was emailed by Amazon to EMAIL_ADDRESS on DATE.<br><br>
Details:<br><br>
Order # NUMBER<br>
Sent to: EMAIL_ADDRESS<br>
Date sent: DATE<br>
Message: Here is a \"thank you\" for ... <br><br>
Please note: This email was sent from a notification-only address that cannot accept incoming email.
Please do not reply to this message.<br><br>
<img width=\"1\" height=\"1\" src=\"URL\">
</body>
</html>
","DATE 01:47:58","gmail","email",,,"An Amazon.com Gift Card you sent has been redeemed","DATE","DATE","f","23",,"EMAIL_ADDRESS","EMAIL_ADDRESS",,"f","EMAIL_ADDRESS","EMAIL_ADDRESS","9","f"
由于默认quotechar
为pd.read_csv
是"
你应该使用quotechar="'"
。
The data has an escape character of \\
which isn't the default. With the following:
df = pd.read_csv(<filename>,header=None,escapechar='\\')
I obtained:
>>> df
0 1 2 3 \
0 \n"241279" EMAIL_ADDRESS EMAIL_ADDRESS 1607be7d4f2d66af
4 5 6 \
0 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Tr... DATE 01:47:58 gmail
7 8 9 ... 14 15 16 17 18 19 \
0 email NaN NaN ... 23 NaN EMAIL_ADDRESS EMAIL_ADDRESS NaN f
20 21 22 23
0 EMAIL_ADDRESS EMAIL_ADDRESS 9 f
[1 rows x 24 columns]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.