简体   繁体   中英

Read CSV with field having multiple quotes and commas

I'm aware this is a much discussed topic and even though there are similar questions I haven't found one that covers my particular case.

I have a csv file that is as follows:

alarm_id,alarm_incident_id,alarm_sitename,alarm_additionalinfo,alarm_summary
"XXXXXXX","XXXXXXXXX","XXXXX|4G_Availability_Issues","TTN-XXXX","XXXXXXX;[{"severity":"CRITICAL","formula":"${XXXXX} < 85"}];[{"name":"XXXXX","value":"0","updateTimestamp":"Oct 27, 2021, 2:00:00 PM"}];[{"coName":{"XXXX/XXX":"MRBTS-XXXX","LNCEL":"XXXXXX","LNBTS":"XXXXXXX"}}]||"

It has more lines but this is the trouble line. If you notice, the fifth field has within it several quotes and commas, which is also the separator. The quotes are also single instead of double quotes which are normally used to signal a quote character that should be kept in the field. What this is doing is splitting this last field into several when reading with pandas.read_csv() method, which throws an error of extra fields. I've tried several configurations and parameters regarding quoting in pandas.read_csv() but none works...

The csv is badly formatted, I just wanted to know if there is a way to still read it, even if using a roundabout way or it really is just hopeless.

Edit: This can happen to more than one column and I never know in which column(s) this may happen

Thank you for your help.

I think I've got what you're looking for, at least I hope. You can read the file as regular, creating a list of the lines in the csv file. Then iterate through the lines variable and split it into 4 parts, since you have 4 columns in the csv.

with open("test.csv", "r") as f:
    lines = f.readlines()

for item in lines:
    new_ls = item.strip().split(",", 4)
    for new_item in new_ls:
        print(new_item)

Now you can iterate through each lines' column item and do whatever you have/want to do.

If all your lines fields are consistently enclosed in quotes, you can try to split the line on "," , and to remove the initial and terminating quote. The current line is correctly separated with:

row = line.strip('"').split('","', 4)

But because of the incorrect formatting of your initial file, you will have to manually control it matches all the lines...

Can't post a comment so just making a post:

One option is to escape the internal quotes / commas, or use a regex.

Also, pandas.read_csv has a quoting parameter where you can adjust how it reacts to quotes, which might be useful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM