简体   繁体   中英

Reading a delimited file where one of the fields can be split over multiple lines (or not)

I have a delimited file that's causing me a bit of grief. It's pipe delimited, 6 fields. but field 4 can be split over several lines or contain nothing. I need a way to remove the newline fields from field 4.

Here's what I've got

import csv

#header is constant
#filedone|fieldtwo|three|four|five|six

content = """"asfdd|b|c|defg
ijklmnopque2
|record|sadfe

1324|b|c|defg
ijklmnopqu
dafdsasfde2asdf
dsfdsf
dsfadfadse2fdsase2
asdfasdfasfe2
|record|afasde

3243243|b|c|defg
ijklmnopque2
|record|adf

startrecord4|b|c||record|adf
"""

def extract():
    x = []
    y = []
    x = content.split('|')
    for item in x:
        if (len(item) > 4):
            y.append(item.replace('\n', '').replace('\r', ' '))
        else:
            y.append(item)
    print(y)


if __name__ == '__main__':
    extract()

This will run and the problem is just output it all in one row. I do still need it to output indivicual records (4 in this case) without the newlines, but I'm not sure how. Can I read the whole file with pandas.read_csv? Is there a better solution?

The header is constant across all records.

Would it be a solution for you to simply replace all double newlines by a placeholder to then explicitely remove the single newlines after which you can restore single newlines at the placeholder positions again?

You can try

sth_unique = '#%@#'
c = content.replace('\n\n', sth_unique).replace('\n', '').replace(sth_unique, '\n')
print(c)

#"asfdd|b|c|defgijklmnopque2|record|sadfe
#1324|b|c|defgijklmnopqudafdsasfde2asdfdsfdsfdsfadfadse2fdsase2asdfasdfasfe2|record|afasde
#3243243|b|c|defgijklmnopque2|record|adf
#startrecord4|b|c||record|adf                   

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM