Could you please help me to figure out how to split a large single line csv file into rows with Python?
Sample File:
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col82022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.02022-07-23 03:00:05,15.706,3765.808,274,30575,20486.0,151905.0,150725.02022-07-23 03:00:10,71.937,4507.922,845,39332,11654.0,31340.0,30925.02022-07-23 03:00:15,82.942,4246.146,937,36611,9177.0,3840.0,3974.02022-07-23 03:00:20,29.969,4122.618,408,33957,7657.0,3685.0,3733.02022-07-23 03:00:25,12.656,3630.578,190,29440,3671.0,2656.0,2663.02022-07-23 03:00:30,8.692,3240.102,108,26290,2576.0,2358.0,2359.02022-07-23
Note: If I open the file in Excel, the column names are displayed once. In a text editor, they show twice as above.
Required output
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
Thank you!
Each group of values starts with a date / time sequence. Define a regular expression to match those. Identify the offsets of where those patterns occur then you can slice the string as follows:
s = 'Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col82022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.02022-07-23 03:00:05,15.706,3765.808,274,30575,20486.0,151905.0,150725.02022-07-23 03:00:10,71.937,4507.922,845,39332,11654.0,31340.0,30925.02022-07-23 03:00:15,82.942,4246.146,937,36611,9177.0,3840.0,3974.02022-07-23 03:00:20,29.969,4122.618,408,33957,7657.0,3685.0,3733.02022-07-23 03:00:25,12.656,3630.578,190,29440,3671.0,2656.0,2663.02022-07-23 03:00:30,8.692,3240.102,108,26290,2576.0,2358.0,2359.0'
m = re.finditer('([0-9]{4}-[0-9]{2}-[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2})', s)
offsets = [0] + [m_.start(0) for m_ in m] + [len(s)]
for o in range(len(offsets)-1):
print(s[offsets[o]:offsets[o+1]])
Output:
Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:05,15.706,3765.808,274,30575,20486.0,151905.0,150725.0
2022-07-23 03:00:10,71.937,4507.922,845,39332,11654.0,31340.0,30925.0
2022-07-23 03:00:15,82.942,4246.146,937,36611,9177.0,3840.0,3974.0
2022-07-23 03:00:20,29.969,4122.618,408,33957,7657.0,3685.0,3733.0
2022-07-23 03:00:25,12.656,3630.578,190,29440,3671.0,2656.0,2663.0
2022-07-23 03:00:30,8.692,3240.102,108,26290,2576.0,2358.0,2359.0
Note:
Repeated column headers removed from source string for the purpose of this example. I have no idea why the input repeats the column headers. That must be an error on the part of whoever generates it
For those interested or searching for a situation like this, below is how I solved the issue. Maybe not the cleanest or most elegant, but it did the trick for me.
with open(file_name, "r", newline="") as file:
for line in file:
split1 = line.split(")D")
split2 = split1[1].split(")2")
split3 = split2[1].split(".02022-")
for i in split3:
f = open("data.csv", "w")
w = csv.writer(f, delimiter=",")
w.writerows([x.split(",") for x in split3])
f.close()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.