繁体   English   中英

如何在 Python 中将单行 csv 文件拆分为多行

[英]How to split single line csv file into multiple lines in Python

你能帮我弄清楚如何将一个大的单行 csv 文件拆分成带有 Python 的行吗?

示例文件:

Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col82022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.02022-07-23 03:00:05,15.706,3765.808,274,30575,20486.0,151905.0,150725.02022-07-23 03:00:10,71.937,4507.922,845,39332,11654.0,31340.0,30925.02022-07-23 03:00:15,82.942,4246.146,937,36611,9177.0,3840.0,3974.02022-07-23 03:00:20,29.969,4122.618,408,33957,7657.0,3685.0,3733.02022-07-23 03:00:25,12.656,3630.578,190,29440,3671.0,2656.0,2663.02022-07-23 03:00:30,8.692,3240.102,108,26290,2576.0,2358.0,2359.02022-07-23

注意:如果我在 Excel 中打开文件,则列名显示一次。 在文本编辑器中,它们显示为上面的两倍。

需要 output

Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8       
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0

谢谢!

实际文件截图

每组值都以日期/时间序列开头。 定义一个正则表达式来匹配那些。 确定这些模式发生位置的偏移量,然后您可以按如下方式对字符串进行切片:

s = 'Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col82022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.02022-07-23 03:00:05,15.706,3765.808,274,30575,20486.0,151905.0,150725.02022-07-23 03:00:10,71.937,4507.922,845,39332,11654.0,31340.0,30925.02022-07-23 03:00:15,82.942,4246.146,937,36611,9177.0,3840.0,3974.02022-07-23 03:00:20,29.969,4122.618,408,33957,7657.0,3685.0,3733.02022-07-23 03:00:25,12.656,3630.578,190,29440,3671.0,2656.0,2663.02022-07-23 03:00:30,8.692,3240.102,108,26290,2576.0,2358.0,2359.0'

m = re.finditer('([0-9]{4}-[0-9]{2}-[0-9]{2}\s[0-9]{2}:[0-9]{2}:[0-9]{2})', s)
offsets = [0] + [m_.start(0) for m_ in m] + [len(s)]

for o in range(len(offsets)-1):
    print(s[offsets[o]:offsets[o+1]])

Output:

Col1,Col2,Col3,Col4,Col5,Col6,Col7,Col8
2022-07-23 03:00:00,101.346,4378.85,1106,37949,8737.0,11490.0,11412.0
2022-07-23 03:00:05,15.706,3765.808,274,30575,20486.0,151905.0,150725.0
2022-07-23 03:00:10,71.937,4507.922,845,39332,11654.0,31340.0,30925.0
2022-07-23 03:00:15,82.942,4246.146,937,36611,9177.0,3840.0,3974.0
2022-07-23 03:00:20,29.969,4122.618,408,33957,7657.0,3685.0,3733.0
2022-07-23 03:00:25,12.656,3630.578,190,29440,3671.0,2656.0,2663.0
2022-07-23 03:00:30,8.692,3240.102,108,26290,2576.0,2358.0,2359.0

笔记:

出于本示例的目的,从源字符串中删除了重复的列标题。 我不知道为什么输入重复列标题。 这一定是生成它的人的错误

对于那些感兴趣或正在寻找这种情况的人,以下是我解决问题的方法。 也许不是最干净或最优雅的,但它对我有用。

with open(file_name, "r", newline="") as file:
    for line in file:
        split1 = line.split(")D")
        split2 = split1[1].split(")2")
        split3 = split2[1].split(".02022-")

        for i in split3:
            f = open("data.csv", "w")
w = csv.writer(f, delimiter=",")
w.writerows([x.split(",") for x in split3])
f.close()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM