简体   繁体   English

Python - CSV 文件中的列包含多个分隔符和结果

[英]Python - Column in CSV file contains multiple delimiters and results

I have quite a large CSV file that has multiple columns (no delimiters) and one column which contains results that use three delimiters.我有一个相当大的 CSV 文件,它有多个列(没有分隔符)和一个包含使用三个分隔符的结果的列。

The main delimiter is ";", which separates days of results.主要分隔符是“;”,它分隔结果的天数。

The second delimiter is ":", which separates results per day (I am only using 2 results out of a possible of 6).第二个分隔符是“:”,它分隔每天的结果(我只使用 6 个结果中的 2 个)。

The third delimiter is "/", which separates the result day and the calendar value of the result.第三个分隔符是“/”,它将结果日期和结果的日历值分开。

I want to avoid looping through the "X&Y" column as much as possible as the column itself contains many delimited results, and there are a lot of rows.我想尽可能避免遍历“X&Y”列,因为该列本身包含许多分隔的结果,并且有很多行。

Col1第 1 列 Col2 Col2 X&Y XY
A一种 B 20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6 20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6
AA机管局 BB BB 20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66 20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66

I want to see:我想看看:

Col1第 1 列 Col2 Col2 Date日期 CalendarValue日历值 X X Y
A一种 B 20200331 20200331 1D一维 1 1 2 2
A一种 B 20200401 20200401 2D二维 3 3 4 4
A一种 B 2020040 2020040 3D 3D 5 5 6 6
AA机管局 BB BB 20210330 20210330 1Y 1年 11 11 22 22
AA机管局 BB BB 20220330 20220330 2Y 2年 33 33 44 44
AA机管局 BB BB 20220330 20220330 3Y 3年 55 55 66 66
import pandas as pd
df = pd.DataFrame({'Col1':['A','AA'], 'Col2':['B', 'BB'], 'Col3':['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6','20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})

Here is a solution you can try out, split based on delimiter (;) followed by explode to transform into rows.这里是一个解决方案,您可以尝试一下,拆分基于分隔符(;)随后explode转变为行。 Followed by extract & finally concat the frames to get resultant frame.接着是extract & 最后concat帧以获得结果帧。

import pandas as pd
import re

df = pd.DataFrame({'Col1': ['A', 'AA'], 'Col2': ['B', 'BB'],
                   'Col3': ['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6',
                            '20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})

df['Col3'] = df['Col3'].str.split(";")

# extract features from the string
extract_ = re.compile(r"(?P<Date>\w+)/(?P<CalendarValue>\w+):+(?P<X>.+):(?P<Y>.+)")

pd.concat([
    df.drop(columns='Col3'),
    df['Col3'].explode().str.extract(extract_, expand=True)
], axis=1)

Out[*]:

  Col1 Col2      Date CalendarValue   X   Y
0    A    B  20200331            1D   1   2
0    A    B  20200401            2D   3   4
0    A    B  20200402            3D   5   6
1   AA   BB  20210330            1Y  11  22
1   AA   BB  20220330            2Y  33  44
1   AA   BB  20230330            3Y  55  66

Regex Demo

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM