使用正则表达式从数据中删除不必要的文本并将其应用到整个 dataframe

Question

我有一个表格，其中有多种格式的日期。 它还有一些我想删除的不需要的文本，以便我可以处理这个日期字符串

Data :

sr.no.           col_1                                col_2
1              'xper may 2022 - nov 2022'          'derh 06/2022 - 07/2022 ubj'
2              'sp@ 2021 - 2022'                   'zpt May 2022 - December 2022'

Expected Output :

sr.no.           col_1                                col_2
1              'may 2022 - nov 2022'           '06/2022 - 07/2022'
2              '2021 - 2022'                   'May 2022 - December 2022'

def keep_valid_characters(string):
    return re.sub(r'(?i)\b(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\b|[^a-z0-9/-]', '', string)

我正在使用上述模式下降但卡住了。 任何其他方法。

Answer 1

在复杂的情况下，您可以尝试将模式构造拆分为多个字符串，如下所示：

months = r"jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|june?|july?|aug(?:ust)?|sep(?:tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?"
pat = rf"(?i)((?:{months})?\s*[\d/]+\s*-\s*(?:{months})?\s*[\d/]+)"

df[["col_1", "col_2"]] = df[["col_1", "col_2"]].transform(lambda x: x.str.extract(pat)[0])
print(df)

印刷：

   sr.no.                col_1                     col_2
0       1  may 2022 - nov 2022         06/2022 - 07/2022
1       2          2021 - 2022  May 2022 - December 2022

使用正则表达式从数据中删除不必要的文本并将其应用到整个 dataframe

问题描述

1 个解决方案

解决方案1
0 已采纳 2023-01-11 14:20:12

使用正则表达式从数据中删除不必要的文本并将其应用到整个 dataframe

问题描述

1 个解决方案

解决方案1 0 已采纳 2023-01-11 14:20:12

解决方案1
0 已采纳 2023-01-11 14:20:12