简体   繁体   English

根据模式替换某些值并在 pandas 中提取 substring

[英]Replace certain values based on pattern and extract substring in pandas

Pandas Dataframe with col1 that contains various dates Pandas Dataframe与 col1 包含各种日期

 col1
 Q2 '20
 Q1 '21
 May '20
 June '20
 25/05/2020
 Q4 '20+Q1 '21
 Q2 '21+Q3 '21
 Q4 '21+Q1 '22

I want to replace certain values in col1 that match a pattern.我想替换col1中与模式匹配的某些值。 For the values that contain 2 quarters with "+" I want to return a season in string plus the first year contained in the pattern.对于包含带有“+”的 2 个季度的值,我想返回字符串中的季节加上模式中包含的第一年。 I want to leave the other values as they are.我想保持其他值不变。

For example:例如:

1) Q4 '20+Q1 '21 should be 'Winter 20' 1) Q4 '20+Q1 '21 应该是'Winter 20'

2) Q2 '21+Q3 '21 should be 'Summer 21' 2) Q2 '21+Q3 '21 应该是'Summer 21'

3) Q4 '21+Q1 '22 should be 'Winter 21' 3) Q4 '21+Q1 '22 应该是'Winter 21'

Desired output:所需的 output:

col1
Q2 '20
Q1 '21
May '20
June '20
25/05/2020
Winter 20
Summer 20
Winter 21

I have tried with a few methods like replace, split, extract.我尝试了几种方法,例如替换、拆分、提取。 But I am not solving the problem.但我没有解决问题。 Using a dictionary would not be helpful because the df is quite big with lots variants of the Q4 'XX+Q1 'XX and Q2 'XX +Q3 'XX使用字典不会有帮助,因为 df 非常大,有很多 Q4 'XX+Q1 'XX 和 Q2 'XX +Q3 'XX 的变体

You could do it matching multiple patterns one for each season:您可以为每个季节匹配多个模式:

df = pd.DataFrame({'col1': [
"Q2 '20",
"Q1 '21",
"May '20",
"June '20",
"25/05/2020",
"Q4 '20+Q1 '21",
"Q2 '21+Q3 '21",
"Q4 '21+Q1 '22"]})

seasons = {
r"Q4 '(\d*)\+Q1 .*": r'Winter \1',
r"Q1 '(\d*)\+Q2 .*": r'Spring \1',
r"Q2 '(\d*)\+Q3 .*": r'Summer \1',
r"Q3 '(\d*)\+Q4 .*": r'Autumn \1'
}

df.col1.replace(seasons, regex=True)

0        Q2 '20
1        Q1 '21
2       May '20
3      June '20
4    25/05/2020
5     Winter 20
6     Summer 21
7     Winter 21

Or the other version which I think is more efficient because I am matching only one regex but i use global variables so i am not sure which version is better.或者我认为更有效的另一个版本,因为我只匹配一个正则表达式,但我使用全局变量,所以我不确定哪个版本更好。

seasons = {
'Q4Q1': 'Winter',
'Q1Q2': 'Spring',
'Q2Q3': 'Summer',
'Q3Q4': 'Autumn'
}
pattern = re.compile(r"(Q\d) '(\d*)\+(Q\d) .*")

def change_to_season(row):
    match = pattern.match(row)
    if match:
        season = seasons[match.group(1) + match.group(3)]
        year = match.group(2)
        return season + ' ' + year
    else:
        return row

df.col1.apply(change_to_season)
'''
col1
Q2 '20
Q1 '21
May '20
June '20
25/05/2020
Q4 '20+Q1 '21
Q2 '21+Q3 '21
Q4 '21+Q1 '22
'''

import pandas as pd

df = pd.read_clipboard(sep="!")

print(df)

Output: Output:

           col1
0         Q2 '20
1         Q1 '21
2        May '20
3       June '20
4     25/05/2020
5  Q4 '20+Q1 '21
6  Q2 '21+Q3 '21
7  Q4 '21+Q1 '22

. .

import re 

def regex_filter(val):
    regex = re.compile(r"([Q][1-4])+ '(\d+)\+([Q][1-4])+ '(\d+)")
    result = regex.split(val)
    result = [val for val in result if val]
    if 'Q3' in result:
        result = 'Summer '+result[-1]
    elif 'Q1' in result:
        result = 'Winter '+result[1]
    else:
        result = ''.join(result)

    return result

df['col1'] = df['col1'].apply(regex_filter)



print(df)

Output: Output:

         col1
0      Q2 '20
1      Q1 '21
2     May '20
3    June '20
4  25/05/2020
5   Winter 20
6   Summer 21
7   Summer 21

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM