简体   繁体   中英

Python: Remove “mid-row” line breaks in CSV

I have a csv generated by a platform we use at work with 86 different fields. The number of fields or “columns” should remain static. The fields are a mix of data types, but some of them have free-form text that contain line breaks.

The issue is that when I import the csv into any program (Excel, notepad, Jupyter Notebook with Python), the free-form text fields are broken into a new row (or multiple new rows if the field contains multiple line breaks).

I've tried a number of things suggested by various threads here, but none have really been applicable to what I'm doing.

Here's an example of the format of what I have in the platform and want in csv (the actual data is more complex, but just to illustrate the issue)(the \\n included below is to illustrate where the actual breaks are, but they are not actually visible in the editor (unless searching for them)):

Header0, H1, H2, H86
Name0, ABC, 123, “Hello\n my name is ABC.\n I live at 123.”\n
Name1, DEF, 456, “Hello\n my name is DEF.\n I live at 456.”\n
Name2, GHI, 789, “Good bye”\n

When I import this into any text editor, Excel, Jupyter with Python using pandas, etc. I get:

Header0, H1, H2, H86
Name0, ABC, 123, “Hello\n
my name is ABC.,,,\n
I live at 123.”,,,\n
Name1, DEF, 456, “Hello\n
my name is DEF.,,,\n
I live at 456.”,,,\n
Name2, GHI, 789, “Good bye”\n

Suggestions have been to remove all line breaks, but that doesn't work because I'd then be removing the legitimate line breaks at the end of each row (otherwise, how would separate rows be designated in the csv? <— not rhetorical, correct me if I'm wrong).

A workaround I've been thinking of is to write a script that iterates through the csv, counting commas and adding each item separated by the commas to a dataframe until it hits 86 commas, then starts adding to the next row in the data frame. But I'd need help actually writing that.

Final note, upon generating the CSV from the platform, commas are removed from all fields, so the only commas in the CSV are those being used as delimiters.

This should do it:

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    outfile = csv.writer(outfile)
    for row in csv.reader(infile):
        outfile.writerow([c.replace('\n', '') for c in row])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM