Python - CSV 文件中的列包含多个分隔符和结果

Question

I have quite a large CSV file that has multiple columns (no delimiters) and one column which contains results that use three delimiters.我有一个相当大的 CSV 文件，它有多个列（没有分隔符）和一个包含使用三个分隔符的结果的列。

The main delimiter is ";", which separates days of results.主要分隔符是“;”，它分隔结果的天数。

The second delimiter is ":", which separates results per day (I am only using 2 results out of a possible of 6).第二个分隔符是“:”，它分隔每天的结果（我只使用 6 个结果中的 2 个）。

The third delimiter is "/", which separates the result day and the calendar value of the result.第三个分隔符是“/”，它将结果日期和结果的日历值分开。

I want to avoid looping through the "X&Y" column as much as possible as the column itself contains many delimited results, and there are a lot of rows.我想尽可能避免遍历“X&Y”列，因为该列本身包含许多分隔的结果，并且有很多行。

Col1第 1 列	Col2 Col2	X&Y XY
A一种	B乙	20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6 20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6
AA机管局	BB BB	20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66 20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66

I want to see:我想看看：

Col1第 1 列	Col2 Col2	Date日期	CalendarValue日历值	X X	Y是
A一种	B乙	20200331 20200331	1D一维	1 1	2 2
A一种	B乙	20200401 20200401	2D二维	3 3	4 4
A一种	B乙	2020040 2020040	3D 3D	5 5	6 6
AA机管局	BB BB	20210330 20210330	1Y 1年	11 11	22 22
AA机管局	BB BB	20220330 20220330	2Y 2年	33 33	44 44
AA机管局	BB BB	20220330 20220330	3Y 3年	55 55	66 66

import pandas as pd
df = pd.DataFrame({'Col1':['A','AA'], 'Col2':['B', 'BB'], 'Col3':['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6','20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})

Answer 1

Here is a solution you can try out, split based on delimiter (;) followed by explode to transform into rows.这里是一个解决方案，您可以尝试一下，拆分基于分隔符(;)随后explode转变为行。 Followed by extract & finally concat the frames to get resultant frame.接着是extract & 最后concat帧以获得结果帧。

import pandas as pd
import re

df = pd.DataFrame({'Col1': ['A', 'AA'], 'Col2': ['B', 'BB'],
                   'Col3': ['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6',
                            '20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})

df['Col3'] = df['Col3'].str.split(";")

# extract features from the string
extract_ = re.compile(r"(?P<Date>\w+)/(?P<CalendarValue>\w+):+(?P<X>.+):(?P<Y>.+)")

pd.concat([
    df.drop(columns='Col3'),
    df['Col3'].explode().str.extract(extract_, expand=True)
], axis=1)

Out[*]:

  Col1 Col2      Date CalendarValue   X   Y
0    A    B  20200331            1D   1   2
0    A    B  20200401            2D   3   4
0    A    B  20200402            3D   5   6
1   AA   BB  20210330            1Y  11  22
1   AA   BB  20220330            2Y  33  44
1   AA   BB  20230330            3Y  55  66

Regex Demo

Python - CSV 文件中的列包含多个分隔符和结果

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-07-23 06:37:32

Python - CSV 文件中的列包含多个分隔符和结果

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-07-23 06:37:32

解决方案1
0 已采纳 2021-07-23 06:37:32