简体   繁体   中英

Python - Column in CSV file contains multiple delimiters and results

I have quite a large CSV file that has multiple columns (no delimiters) and one column which contains results that use three delimiters.

The main delimiter is ";", which separates days of results.

The second delimiter is ":", which separates results per day (I am only using 2 results out of a possible of 6).

The third delimiter is "/", which separates the result day and the calendar value of the result.

I want to avoid looping through the "X&Y" column as much as possible as the column itself contains many delimited results, and there are a lot of rows.

Col1 Col2 X&Y
A B 20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6
AA BB 20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66

I want to see:

Col1 Col2 Date CalendarValue X Y
A B 20200331 1D 1 2
A B 20200401 2D 3 4
A B 2020040 3D 5 6
AA BB 20210330 1Y 11 22
AA BB 20220330 2Y 33 44
AA BB 20220330 3Y 55 66
import pandas as pd
df = pd.DataFrame({'Col1':['A','AA'], 'Col2':['B', 'BB'], 'Col3':['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6','20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})

Here is a solution you can try out, split based on delimiter (;) followed by explode to transform into rows. Followed by extract & finally concat the frames to get resultant frame.

import pandas as pd
import re

df = pd.DataFrame({'Col1': ['A', 'AA'], 'Col2': ['B', 'BB'],
                   'Col3': ['20200331/1D::::1:2;20200401/2D::::3:4;20200402/3D::::5:6',
                            '20210330/1Y::::11:22;20220330/2Y::::33:44;20230330/3Y::::55:66']})

df['Col3'] = df['Col3'].str.split(";")

# extract features from the string
extract_ = re.compile(r"(?P<Date>\w+)/(?P<CalendarValue>\w+):+(?P<X>.+):(?P<Y>.+)")

pd.concat([
    df.drop(columns='Col3'),
    df['Col3'].explode().str.extract(extract_, expand=True)
], axis=1)

Out[*]:

  Col1 Col2      Date CalendarValue   X   Y
0    A    B  20200331            1D   1   2
0    A    B  20200401            2D   3   4
0    A    B  20200402            3D   5   6
1   AA   BB  20210330            1Y  11  22
1   AA   BB  20220330            2Y  33  44
1   AA   BB  20230330            3Y  55  66

Regex Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM