So I have this dataset where there are sometimes random newline characters entered into some cells, and I need to delete them.
this is what I've tried:
with open ('filepath') as inf, open('filepath', 'w') as outf:
for line in inf:
outf.write(line.replace('\n', ''))
Unfortunately, this removed ALL newline characters, including the ones at the end of the row, which turns my csv file into a big one-liner
Does anyone know how I can only delete the random newline characters and not the 'real' endline characters?
Edit: If it helps, each 'real' new line starts with a 6 digit string of numbers (besides for the header line). Maybe some regex pattern that looks ahead to detect if there's some number string could work?
Edit2: I've tried using pandas to edit it with:
df = pd.read_csv(filepath)
for i in df.columns:
if df[i].dtype==np.object:
df[i] = df[i].str.replace('\n','')
weirdly, this works if I copy the stuff inside the.csv into a new text file, but it doesn't work on my original csv file, and I'm not sure why.
Final Edit:
So big thanks to DDS for his help. Managed to get it to work using this:
num_cols = 48
buf = ""
with open (filepath) as inf, open (filepath, 'w') as outf:
for line in inf:
if len(line.split(',')) < num_cols:
buf += line.replace('\n', '')
if len(buf.split(',')) == num_cols:
outf.write(buf+'\n')
else: continue
buf = ""
else:
outf.write(line)
There are multiple ways you can achieve this.
with open ('filepath') as inf, open('filepath', 'w') as outf:
for line in inf:
outf.write(line.replace('\n', '') + '\n')
n - 1
as the number of the newline characters to replace with open ('filepath') as inf, open('filepath', 'w') as outf:
for line in inf:
outf.write(line.replace('\n', '', line.count('\n') - 1))
import re
result = re.sub( '\n*(?=.*\n)','' ,'ansd\nasdn\naskd\n')
print(result)
'ansdasdnaskd\n'
First control your line that is empty or not than write the line
for line in inf:
if len(line.strip()) == 0:
outf.write(line.replace('\n', ''))
else:
outf.write(line)
Assuming you know the number of fields per line and no field contains the csv separator (comma): you could do like this:
number_of_columns_in_the_table = 5 #assuming a line has 5 columns
with open ('filepath') as inf, open('filepath', 'w') as outf:
for line in inf:
#check if the number of "splits equals the nummber of fields"
if len(line.split(',')) < number_of_columns_in_the_table
outf.write(line.replace('\n', ''))
else:
outf.write(line)
EDIT
number_of_columns_in_the_table = 5 #assuming a line has 5 columns
with open ('filepath') as inf, open('filepath', 'w') as outf:
for line in inf:
#check if the number of "splits equals the nummber of fields"
if len(line.split(',')) < number_of_columns_in_the_table
buf += line.replace('\n', '');
if len(line.split(',')) == number_of_columns_in_the_table
outf.write( buf)
else:
outf.write(line)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.