简体   繁体   中英

Reading with Python a CSV which contains complex strings

I have a file that I am trying to read into a Pandas DataFrame that has a column with a complex string in it. The string contains an HTML output and is similar to the following:

"<!DOCTYPE html PUBLIC \\\\"-//W3C//DTD HTML 4.0 Transitional//EN\\\\">\\n', '<html>\\n', '<head>\\n', '<meta http-equiv=\\\\"Content-Type\\\\" content=\\\\"text/html; charset=UTF-8\\\\">\\n', '<meta charset=\\\\"utf-8\\\\">\\n', '<title>An Amazon.com Gift Card you sent has been redeemed</title>\\n', '</head>\\n', '<body>\\n',

I have tried the following so far:

df = pd.read_csv("<filename>",nrows = 50)

Which returns the following .head() :

在此处输入图片说明

I have tried using "escapechar= " , but must not have gotten the syntax right.

To be clear, this HTML string will be one part of the overall CSV file, and the above string will be only one cell of a given row. See below for a sample row of the CSV file. There are 24 columns being served in this CSV:

"241279","EMAIL_ADDRESS","EMAIL_ADDRESS","1607be7d4f2d66af","<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"URL\">
<html>
<head>
<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">
<meta charset=\"utf-8\">
<title>An Amazon.com Gift Card you sent has been redeemed</title>
</head>
<body>
<img width=\"1\" height=\"1\" src=\"URL\">
Greetings from Amazon.com,<br><br>

We wanted to let you know you that an Amazon.com Gift Card you sent has been redeemed.<br><br>
The gift card was emailed by Amazon to EMAIL_ADDRESS on DATE.<br><br>
Details:<br><br>

   Order # NUMBER<br>
   Sent to: EMAIL_ADDRESS<br>
   Date sent: DATE<br>
   Message: Here is a \"thank you\" for ... <br><br>

Please note: This email was sent from a notification-only address that cannot accept incoming email.
Please do not reply to this message.<br><br>
<img width=\"1\" height=\"1\" src=\"URL\">
</body>
</html>
","DATE 01:47:58","gmail","email",,,"An Amazon.com Gift Card you sent has been redeemed","DATE","DATE","f","23",,"EMAIL_ADDRESS","EMAIL_ADDRESS",,"f","EMAIL_ADDRESS","EMAIL_ADDRESS","9","f"

由于默认quotecharpd.read_csv"你应该使用quotechar="'"

The data has an escape character of \\ which isn't the default. With the following:

df = pd.read_csv(<filename>,header=None,escapechar='\\')

I obtained:

>>> df
           0              1              2                 3   \
0  \n"241279"  EMAIL_ADDRESS  EMAIL_ADDRESS  1607be7d4f2d66af   

                                                  4              5      6   \
0  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Tr...  DATE 01:47:58  gmail   

      7   8   9  ...  14  15             16             17  18  19  \
0  email NaN NaN ...  23 NaN  EMAIL_ADDRESS  EMAIL_ADDRESS NaN   f   

              20             21  22 23  
0  EMAIL_ADDRESS  EMAIL_ADDRESS   9  f  

[1 rows x 24 columns]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM