Python，解析多行字符串提取字符和數字子串

Question

這是我之前的一個問題的后續，我更清楚地發現了這個問題，我需要一些進一步的建議:)

我有一個字符串，由一些機器學習算法產生，它通常具有以下結構：

在開頭和結尾，可以有一些行不包含任何字符（空格除外）；
中間應該有 2 行，每行包含一個名字（或者只有姓氏，或者名字和姓氏，或者名字的首字母加上姓氏......），然后是一些數字和（有時）混合的其他字符在數字之間；
其中一個名稱通常以特殊的非字母數字字符（>、>>、@、...）開頭。

像這樣的東西：

Connery  3 5 7 @  4
>> R. Moore 4 5 67| 5 [

我需要提取 2 個名稱和數字字符，並檢查其中一行是否以特殊字符開頭，因此我的輸出應該是：。

name_01 = 'Connery'
digits_01 = [3, 5, 7, 4]
name_02 = 'R. Moore'
digits_02 = [4, 5, 67, 5]
selected_line = 2 (anything indicating that it's the second line)

在鏈接的原始問題中，有人建議我使用：

inp = '''Connery  3 5 7 @  4
    >> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
    matches = re.findall(r'\w+', line)
    print(matches)

產生的結果非常接近我想要的結果：

['Connery', '3', '5', '7', '4']
['R', 'Moore', '4', '5', '67', '5']

但是我需要將第二行中的前兩個字符串（'R'、'Moore'）組合在一起（基本上，在數字開始之前將所有字符組合在一起）。 並且，它會跳過特殊字符的檢測。 我應該以某種方式修復這個輸出，還是可以用完全不同的方式解決這個問題？

Answer 1

我不確定您希望保留或刪除哪些字符，但以下內容應該適用於該示例：

inp = '''Connery  3 5 7 @  4
    >> R. Moore 4 5 67| 5 ['''
lines = inp.split('\n')
for line in lines:
    matches = re.findall(r'(?:[a-zA-Z.][a-zA-Z.\s]+[a-zA-Z.])|\w+', line)
    print(matches)

輸出：

['Connery', '3', '5', '7', '4']
['R. Moore', '4', '5', '67', '5']

注意。 我包括了az （下和上）和點，中間有可選的空格： [a-zA-Z.][a-zA-Z.\\s]+[a-zA-Z.] ，但你應該更新到您的真正需要。

Answer 2

這還將包括特殊字符（請記住，它們是硬編碼的，因此您必須將缺少的字符添加到正則表達式部分[>@]+ ）

for line in lines:
    matches = re.findall(r'(?:[a-zA-Z.][a-zA-Z.\s]+[a-zA-Z.])|\w+|[>@]+', line)
    print(matches)

Answer 3

這最好分幾個步驟完成。

# get the whitespace at start and end out
lines = inp.strip().split('\n')
for line in lines:
    # for each line, identify the selection mark, the name, and the mess at the end
    # assuming names can't have numbers in them
    match = re.match(r'^(\W+)?([^\d]+?)\s*([^a-zA-Z]+)$', line.strip())
    if match:
        selected_raw, name, numbers_raw = match.groups()
        # now parse the unprocessed bits
        selected = selected_raw is not None
        numbers = re.findall(r'\d+', numbers_raw)
        print(selected, name, numbers)

# output
False Connery ['3', '5', '7', '4']
True R. Moore ['4', '5', '67', '5']

Python，解析多行字符串提取字符和數字子串

問題描述

3 個解決方案

解決方案1
0 2021-10-26 12:24:10

解決方案2
0 2021-10-26 12:28:30

解決方案3
0 2021-10-26 12:32:42

Python，解析多行字符串提取字符和數字子串

問題描述

3 個解決方案

解決方案1 0 2021-10-26 12:24:10

解決方案2 0 2021-10-26 12:28:30

解決方案3 0 2021-10-26 12:32:42

解決方案1
0 2021-10-26 12:24:10

解決方案2
0 2021-10-26 12:28:30

解決方案3
0 2021-10-26 12:32:42