[英]Dropping unnecessary text from data using regex and applying it entire dataframe
我有一个表格,其中有多种格式的日期。 它还有一些我想删除的不需要的文本,以便我可以处理这个日期字符串
Data :
sr.no. col_1 col_2
1 'xper may 2022 - nov 2022' 'derh 06/2022 - 07/2022 ubj'
2 'sp@ 2021 - 2022' 'zpt May 2022 - December 2022'
Expected Output :
sr.no. col_1 col_2
1 'may 2022 - nov 2022' '06/2022 - 07/2022'
2 '2021 - 2022' 'May 2022 - December 2022'
def keep_valid_characters(string):
return re.sub(r'(?i)\b(jan(uary)?|feb(ruary)?|mar(ch)?|apr(il)?|may|jun(e)?|jul(y)?|aug(ust)?|sep(tember)?|oct(ober)?|nov(ember)?|dec(ember)?)\b|[^a-z0-9/-]', '', string)
我正在使用上述模式下降但卡住了。 任何其他方法。
在复杂的情况下,您可以尝试将模式构造拆分为多个字符串,如下所示:
months = r"jan(?:uary)?|feb(?:ruary)?|mar(?:ch)?|apr(?:il)?|may|june?|july?|aug(?:ust)?|sep(?:tember)?|oct(?:ober)?|nov(?:ember)?|dec(?:ember)?"
pat = rf"(?i)((?:{months})?\s*[\d/]+\s*-\s*(?:{months})?\s*[\d/]+)"
df[["col_1", "col_2"]] = df[["col_1", "col_2"]].transform(lambda x: x.str.extract(pat)[0])
print(df)
印刷:
sr.no. col_1 col_2
0 1 may 2022 - nov 2022 06/2022 - 07/2022
1 2 2021 - 2022 May 2022 - December 2022
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.