[英]Python - Column in CSV file contains multiple delimiters and results
I have quite a large CSV file that has multiple columns (no delimiters) and one column which contains results that use three delimiters.我有一个相当大的 CSV 文件,它有多个列(没有分隔符)和一个包含使用三个分隔符的结果的列。
The main delimiter is ";", which separates days of results.主要分隔符是“;”,它分隔结果的天数。
The second delimiter is ":", which separates results per day (I am only using 2 results out of a possible of 6).第二个分隔符是“:”,它分隔每天的结果(我只使用 6 个结果中的 2 个)。
The third delimiter is "/", which separates the result day and the calendar value of the result.第三个分隔符是“/”,它将结果日期和结果的日历值分开。
I want to avoid looping through the "X&Y" column as much as possible as the column itself contains many delimited results, and there are a lot of rows.我想尽可能避免遍历“X&Y”列,因为该列本身包含许多分隔的结果,并且有很多行。
Col1第 1 列 | Col2 Col2 | X&Y XY |
---|---|---|
A一种 | B乙 | 20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6 20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6 |
AA机管局 | BB BB | 20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66 20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66 |
I want to see:我想看看:
Col1第 1 列 | Col2 Col2 | Date日期 | CalendarValue日历值 | X X | Y是 |
---|---|---|---|---|---|
A一种 | B乙 | 20200331 20200331 | 1D一维 | 1 1 | 2 2 |
A一种 | B乙 | 20200401 20200401 | 2D二维 | 3 3 | 4 4 |
A一种 | B乙 | 2020040 2020040 | 3D 3D | 5 5 | 6 6 |
AA机管局 | BB BB | 20210330 20210330 | 1Y 1年 | 11 11 | 22 22 |
AA机管局 | BB BB | 20220330 20220330 | 2Y 2年 | 33 33 | 44 44 |
AA机管局 | BB BB | 20220330 20220330 | 3Y 3年 | 55 55 | 66 66 |
import pandas as pd
df = pd.DataFrame({'Col1':['A','AA'], 'Col2':['B', 'BB'], 'Col3':['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6','20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})
Here is a solution you can try out, split based on delimiter (;)
followed by explode
to transform into rows.这里是一个解决方案,您可以尝试一下,拆分基于分隔符(;)
随后explode
转变为行。 Followed by extract
& finally concat
the frames to get resultant frame.接着是extract
& 最后concat
帧以获得结果帧。
import pandas as pd
import re
df = pd.DataFrame({'Col1': ['A', 'AA'], 'Col2': ['B', 'BB'],
'Col3': ['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6',
'20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})
df['Col3'] = df['Col3'].str.split(";")
# extract features from the string
extract_ = re.compile(r"(?P<Date>\w+)/(?P<CalendarValue>\w+):+(?P<X>.+):(?P<Y>.+)")
pd.concat([
df.drop(columns='Col3'),
df['Col3'].explode().str.extract(extract_, expand=True)
], axis=1)
Out[*]:
Col1 Col2 Date CalendarValue X Y
0 A B 20200331 1D 1 2
0 A B 20200401 2D 3 4
0 A B 20200402 3D 5 6
1 AA BB 20210330 1Y 11 22
1 AA BB 20220330 2Y 33 44
1 AA BB 20230330 3Y 55 66
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.