简体   繁体   中英

How to remove newline characters in csv columns, without removing the endrow newline character?

So I have this dataset where there are sometimes random newline characters entered into some cells, and I need to delete them.

this is what I've tried:

with open ('filepath') as inf, open('filepath', 'w') as outf:
    for line in inf:
        outf.write(line.replace('\n', ''))

Unfortunately, this removed ALL newline characters, including the ones at the end of the row, which turns my csv file into a big one-liner

Does anyone know how I can only delete the random newline characters and not the 'real' endline characters?

Edit: If it helps, each 'real' new line starts with a 6 digit string of numbers (besides for the header line). Maybe some regex pattern that looks ahead to detect if there's some number string could work?

Edit2: I've tried using pandas to edit it with:

df = pd.read_csv(filepath)

for i in df.columns:
    if df[i].dtype==np.object:
        df[i] = df[i].str.replace('\n','')

weirdly, this works if I copy the stuff inside the.csv into a new text file, but it doesn't work on my original csv file, and I'm not sure why.

Final Edit:

So big thanks to DDS for his help. Managed to get it to work using this:

num_cols = 48

buf = ""

with open (filepath) as inf, open (filepath, 'w') as outf:
    for line in inf:
        if len(line.split(',')) < num_cols:
            buf += line.replace('\n', '')
            if len(buf.split(',')) == num_cols:
                outf.write(buf+'\n')
            else: continue
            buf = ""
        else:
            outf.write(line)

There are multiple ways you can achieve this.

  1. Since you're only concerned about the last occurrence of newline char to be present, you can add a newline character at the end of the replaced string
    with open ('filepath') as inf, open('filepath', 'w') as outf:
    for line in inf:
        outf.write(line.replace('\n', '') + '\n')
  1. You can keep a count of the newline characters present and make use of the count argument of the replace method to pass n - 1 as the number of the newline characters to replace
    with open ('filepath') as inf, open('filepath', 'w') as outf:
    for line in inf:
        outf.write(line.replace('\n', '', line.count('\n') - 1))
  1. Make use of the re library of python to do substitution by having a check ahead to replace the newline character iff there a succeeding newline.
    import re
    result = re.sub( '\n*(?=.*\n)','' ,'ansd\nasdn\naskd\n')
    print(result)
    'ansdasdnaskd\n'

First control your line that is empty or not than write the line

 for line in inf:
    if len(line.strip()) == 0:
          outf.write(line.replace('\n', ''))
    else:
        outf.write(line)

Assuming you know the number of fields per line and no field contains the csv separator (comma): you could do like this:

    number_of_columns_in_the_table = 5 #assuming a line has 5 columns
    with open ('filepath') as inf, open('filepath', 'w') as outf:
        for line in inf:
            #check if the number of "splits equals the nummber of fields"
            if len(line.split(',')) < number_of_columns_in_the_table
               
 outf.write(line.replace('\n', ''))
            else:
                outf.write(line)

EDIT

number_of_columns_in_the_table = 5 #assuming a line has 5 columns
    with open ('filepath') as inf, open('filepath', 'w') as outf:
        for line in inf:
            #check if the number of "splits equals the nummber of fields"
            if len(line.split(',')) < number_of_columns_in_the_table
               buf += line.replace('\n', '');
           if len(line.split(',')) == number_of_columns_in_the_table
               outf.write( buf)
            else:
                outf.write(line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM