简体   繁体   中英

How to split single line csv file into multiple lines in Python

Could you please help me to figure out how to split a large single line csv file into rows with Python?

Sample File:

Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col82022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.02022-07-23 03:00:05,15.706,3765.808,274,30575,20486.0,151905.0,150725.02022-07-23 03:00:10,71.937,4507.922,845,39332,11654.0,31340.0,30925.02022-07-23 03:00:15,82.942,4246.146,937,36611,9177.0,3840.0,3974.02022-07-23 03:00:20,29.969,4122.618,408,33957,7657.0,3685.0,3733.02022-07-23 03:00:25,12.656,3630.578,190,29440,3671.0,2656.0,2663.02022-07-23 03:00:30,8.692,3240.102,108,26290,2576.0,2358.0,2359.02022-07-23

Note: If I open the file in Excel, the column names are displayed once. In a text editor, they show twice as above.

Required output

Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8       
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0

Thank you!

Screenshot of actual file

Each group of values starts with a date / time sequence. Define a regular expression to match those. Identify the offsets of where those patterns occur then you can slice the string as follows:

s = 'Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col82022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.02022-07-23 03:00:05,15.706,3765.808,274,30575,20486.0,151905.0,150725.02022-07-23 03:00:10,71.937,4507.922,845,39332,11654.0,31340.0,30925.02022-07-23 03:00:15,82.942,4246.146,937,36611,9177.0,3840.0,3974.02022-07-23 03:00:20,29.969,4122.618,408,33957,7657.0,3685.0,3733.02022-07-23 03:00:25,12.656,3630.578,190,29440,3671.0,2656.0,2663.02022-07-23 03:00:30,8.692,3240.102,108,26290,2576.0,2358.0,2359.0'

m = re.finditer('([0-9]{4}-[0-9]{2}-[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2})', s)
offsets = [0] + [m_.start(0) for m_ in m] + [len(s)]

for o in range(len(offsets)-1):
    print(s[offsets[o]:offsets[o+1]])

Output:

Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:05,15.706,3765.808,274,30575,20486.0,151905.0,150725.0
2022-07-23 03:00:10,71.937,4507.922,845,39332,11654.0,31340.0,30925.0
2022-07-23 03:00:15,82.942,4246.146,937,36611,9177.0,3840.0,3974.0
2022-07-23 03:00:20,29.969,4122.618,408,33957,7657.0,3685.0,3733.0
2022-07-23 03:00:25,12.656,3630.578,190,29440,3671.0,2656.0,2663.0
2022-07-23 03:00:30,8.692,3240.102,108,26290,2576.0,2358.0,2359.0

Note:

Repeated column headers removed from source string for the purpose of this example. I have no idea why the input repeats the column headers. That must be an error on the part of whoever generates it

For those interested or searching for a situation like this, below is how I solved the issue. Maybe not the cleanest or most elegant, but it did the trick for me.

with open(file_name, "r", newline="") as file:
    for line in file:
        split1 = line.split(")D")
        split2 = split1[1].split(")2")
        split3 = split2[1].split(".02022-")

        for i in split3:
            f = open("data.csv", "w")
w = csv.writer(f, delimiter=",")
w.writerows([x.split(",") for x in split3])
f.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM