简体   繁体   中英

removing line breaks in a csv file

I have a csv file with lines, each line begins with (@) and all the fields within a line are separated with (;). One of the fields, that contains "Text" (""[ ]""), has some line breaks that produce errors while importing the whole csv file to excel or access. The text after the line breaks is considered as independent lines, not following the structure of the table.

@4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; ""[OJO!
la premiacin de los #Oscar, nuestros amigos de @cinencuentro revisan las categoras.
+info: co/plHcfSIfn8]""; 0
@624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; ""[Porque nunca dejamos de amar]""; 0

any help with this using a python script? or any other solution...

as output I would like to have the lines:

@4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; ""[OJO! la premiacin de los #Oscar, nuestros amigos de @cinencuentro revisan las categoras. +info: co/plHcfSIfn8]""; 0
@624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; ""[Porque nunca dejamos de amar]""; 0

any help? I a csv file (54MB) with a lot of lines with line breaks... some other lines are ok...

You should share your expected output as well.

Anyways, I suggest you first clean your file to remove the newline characters. Then you can read it as csv. One solution can be (I believe someone will suggest something better :-) )

Clean the file (on linux):

sed ':a;N;$!ba;s/\n/ /g' input_file | sed "s/ @/\n@/g" > output_file

Read file as csv (You can read it using any other method)

import pandas as pd
df = pd.read_csv('output_file', delimiter=';', header=None)
df.to_csv('your_csv_file_name', index=False)

Let's see if it helps you :-)

You can search for lines that are followed by a line that doesn't start with "@", like this \\r?\\n+(?!@\\d+;) .

The following was generated from this regex101 demo . It replaces such line ends with a space. You can change that to whatever you like.

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"\r?\n+(?!@\d+;)"

test_str = ("@4627289301; Lima, Peru; 490; 835551022915420161; Sat Feb 25 18:04:22 +0000 2017; \"\"[OJO!\n"
    "la premiacin de los #Oscar, nuestros amigos de @cinencuentro revisan las categoras.\n"
    "+info: co/plHcfSIfn8]\"\"; 0\n"
    "@624974422; None; 114; 835551038581137416; Sat Feb 25 18:04:26 +0000 2017; \"\"[Porque nunca dejamos de amar]\"\"; 0")

subst = " "

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM