在 Python 中使用正則表達式提取 substring

Question

我試着尋找以前的帖子，但找不到任何與我正在尋找的完全匹配的帖子，所以就到這里吧。

我正在嘗試解析 dataframe 中的字符串，並在找到匹配項時捕獲某個 substring（年份）。 格式可能會有很大差異，我想出了一種不太優雅的方法來完成它，但我想知道是否有更好的方法。

字符串可以看起來像這樣

Random Text 31.12.2020
1.1. -31.12.2020
010120-311220
31.12.2020
1.1.2020-31.12.2020 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words

我正在尋找年份，目前是通過查找最后日期及其年份。 當前的正則表達式是.+3112(\d{2,4})|.+31\.12\.(\d{2,4})其中

它將在010120-311220的組 1 中返回 20，在1.1.2020-31.12.2020 -的組 2 中返回 2020 。

問題是我無法事先知道匹配將屬於哪個組，因為在第一個示例中，第 2 組不存在，而在第二個示例中，當使用re.match(regexPattern, stringOfInterest)時，第 1 組將返回 None 。 因此，我無法通過在匹配項 object 上天真地使用.group(1)來訪問該值，因為有時該值會在.group(2)中。

到目前為止我想出的最好的方法是用(?P<groupName>\d{2,4)命名組並檢查 Nones

def getYear(stringOfInterest):
    regexPattern = '(^|.+)3112(?P<firstMatchType>\d{2,4})|(^|.+)31\.12\.(?P<secondMatchType>\d{2,4})'
    matchObject = re.match(regexPattern, stringOfInterest)
    if matchObject is not None:
        matchDict = matchObject.groupdict()
        if matchDict['firstMatchType'] is not None:
            return matchDict['firstMatchType']
        else:
            return matchDict['secondMatchType']
    return None

import re
df['year'] = df['text'].apply(getYear)

雖然這行得通，但直覺上這似乎是一種愚蠢的做法。 有任何想法嗎？

Answer 1

看起來你所有的歲月都來自二十一^世紀。 在這種情況下，您只需要

df['year'] = '20' + df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)

請參閱正則表達式演示。 詳情：

.* - 盡可能多的除換行符以外的任何零個或多個字符
31\.?12\.? - 31 ，一個可選的. , 12和一個可選的. 字符
(?:\d{2})? - 可選的兩位數序列
(\d{2}) - 第 1 組：年份的最后兩位數字。

看一個 Pandas 測試：

import pandas as pd
df = pd.DataFrame({'text': ['Random Text 31.12.2020','1.1. -31.12.2020','010120-311220','31.12.2020','1.1.2020-31.12.2020 -','1.1.2019 - 31.12.2019','1.1. . . 31.12.2019 -','1.1.2019 - -31.12.2019','010120-311220 other random words']})
df['year'] = '20' + df['text'].str.extract(r'.*31\.?12\.?(?:\d{2})?(\d{2})', expand=False)

Output：

>>> df
                               text  year
0            Random Text 31.12.2020  2020
1                  1.1. -31.12.2020  2020
2                     010120-311220  2020
3                        31.12.2020  2020
4             1.1.2020-31.12.2020 -  2020
5             1.1.2019 - 31.12.2019  2019
6             1.1. . . 31.12.2019 -  2019
7            1.1.2019 - -31.12.2019  2019
8  010120-311220 other random words  2020

Answer 2

我們可以嘗試在此處對您的輸入列表使用re.findall ，並使用涵蓋兩種變體的正則表達式交替：

inp = ["Random Text 31.12.2020", "1.1. -31.12.2020", "010120-311220", "31.12.2020", "1.1.2020-31.12.2020 -", "1.1.2019 - 31.12.2019", "1.1. . . 31.12.2019 -", "1.1.2019 - -31.12.2019", "010120-311220 other random words"]
output = [re.findall(r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})', x)[-1] for x in inp]
output = [x[0] if x[0] else x[1] for x in output]
print(output)  # ['2020', '2020', '20', '2020', '2020', '2019', '2019', '2019', '20']

這里的策略是匹配兩個日期變體中的任何一個。 我們保留每個輸入的最后一個匹配項。 然后，我們使用列表理解來查找非空值。 請注意，有兩個捕獲組，因此只有一個會匹配。

Answer 3

通過僅對日期開始的交替進行分組，您的正則表達式可以分解很多； 這消除了檢查兩組的需要：

regexPattern = r'(?:^|.+)(?:3112|31\.12\.)(?P<year>\d{2,4})'

提取組后，可以將其標准化為適當的四位數年份：

if matchObject is not None:
    return ('20' + matchObject.group('year'))[-4:]

總而言之，我們得到：

import re

def getYear(stringOfInterest):
    regexPattern = r'(?:^|.+)(?:3112|31\.12\.)(?P<year>\d{2,4})'
    matchObject = re.match(regexPattern, stringOfInterest)
    if matchObject is not None:
        return ('20' + matchObject.group('year'))[-4:]
    return None

df['year'] = df['text'].apply(getYear)

Answer 4

這是我解決你問題的方法，也許會有用


import re
string = '''
Random Text 31.12.2020
1.1. -31.12.2022
010120-311220
31.12.2020
1.1.2020-31.12.2018 -
1.1.2019 - 31.12.2019
1.1. . . 31.12.2019 -
1.1.2019 - -31.12.2019
010120-311220 other random words'''
pattern = r'\d{1,2}\.\d{1,2}\.(\d{4})|\d{4}(\d{2})'
matches = re.findall(pattern, string)
print("1) ", matches)

# convert tuple to list
match_array = [i for sub in matches for i in sub]
print(match_array)

#Remove multiple empty spaces from string List
res = [element for element in match_array if element.strip()]
print(res)

在 Python 中使用正則表達式提取 substring

問題描述

4 個解決方案

解決方案1
1 已采納 2022-03-16 10:38:56

解決方案2
0 2022-03-16 07:55:59

解決方案3
0 2022-03-16 09:28:29

解決方案4
0 2022-03-16 10:31:23

在 Python 中使用正則表達式提取 substring

問題描述

4 個解決方案

解決方案1 1 已采納 2022-03-16 10:38:56

解決方案2 0 2022-03-16 07:55:59

解決方案3 0 2022-03-16 09:28:29

解決方案4 0 2022-03-16 10:31:23

解決方案1
1 已采納 2022-03-16 10:38:56

解決方案2
0 2022-03-16 07:55:59

解決方案3
0 2022-03-16 09:28:29

解決方案4
0 2022-03-16 10:31:23