拆分包含多個 substring 的字符串

Question

我有一個字符串names列表

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

我想拆分包含以下多個子字符串的字符串：

substrings = ['Vice president', 'Affiliate', 'Acquaintance']

更准確地說，我想在 substring 后面的單詞的最后一個字符之后拆分

desired_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose']

我不知道如何在我的代碼中實現“多個”條件：

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|Affiliate|Acquaintance')
    splitted = []
    for i in names:
        if substrings in i:
            splitted.append([])
        splitted[-1].append(item)

例外：當最后一個字符是一個點（例如Prof. ）時，在 substring 之后的第二個單詞之后拆分。

更新： names比我想象的更復雜

類似標題的模式已經正確回答了（ 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose' ）
直到出現第二個字符串模式（ 'Mister Kelly, AWS' ）
直到第三個字符串模式結束（ 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary' ）

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose', 'Vice president Dr. John Mister Schmid, PRT Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary']

有時Secretary后面跟着不同的規格。 在下一個名字出現之前，我不關心這些有時會跟隨Secretary的角色。 它們可以被丟棄。 當然， 'Secretary'應該存儲在updated_output中。

我為Secretary之后的內容創建了一個——希望是詳盡的——列表specifications 。 這是列表的表示形式： specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']

更新的問題：我如何使用specification列表來解釋第三種模式？

updated_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose', 'Vice president Dr. John', 'Mister Schmid, PRT', 'Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary of State', 'Dr. Dews, Member', 'Miss Berg, Secretary for Relations, 'Dr. Jakob, Secretary']

Answer 1

您想要在這三個標題之一之前的單詞邊界處拆分，因此您可以查找單詞邊界\b后跟這些標題之一的正前瞻(?=...) ：

>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    ['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

然后，您可以修剪並丟棄空結果：

>>> v = [x for i in v if (x := i.strip())]
    ['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']

使用輸入字符串列表，只需將此處理應用於所有字符串：

def get_names(s):
    v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    return [x for i in v if (x := i.strip())]


names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

output = []
for n in names:
    output.extend(get_names(n))

這使：

output = ['Acquaintance Muller',
 'Vice president Johnson',
 'Affiliate Peterson',
 'Acquaintance Dr. Rose']

Answer 2

嘗試：

import re

names = [
    "acquaintance Muller",
    "Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]

r = re.compile("|".join(map(re.escape, substrings)))

out = []
for n in names:
    starts = [i.start() for i in r.finditer(n)]

    if not starts:
        out.append(n)
        continue

    if starts[0] != 0:
        starts = [0, *starts]

    starts.append(len(n))
    for a, b in zip(starts, starts[1::]):
        out.append(n[a:b])

print(out)

印刷：

['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

拆分包含多個 substring 的字符串

問題描述

2 個解決方案

解決方案1
2 2022-04-14 17:50:23

解決方案2
1 已采納 2022-04-14 17:15:42

拆分包含多個 substring 的字符串

問題描述

2 個解決方案

解決方案1 2 2022-04-14 17:50:23

解決方案2 1 已采納 2022-04-14 17:15:42

解決方案1
2 2022-04-14 17:50:23

解決方案2
1 已采納 2022-04-14 17:15:42