簡體   English   中英

拆分包含多個 substring 的字符串

[英]split strings that contain more than one substring

我有一個字符串names列表

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

我想拆分包含以下多個子字符串的字符串:

substrings = ['Vice president', 'Affiliate', 'Acquaintance']

更准確地說,我想在 substring 后面的單詞的最后一個字符之后拆分

desired_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose']

我不知道如何在我的代碼中實現“多個”條件:

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|Affiliate|Acquaintance')
    splitted = []
    for i in names:
        if substrings in i:
            splitted.append([])
        splitted[-1].append(item)

例外:當最后一個字符是一個點(例如Prof. )時,在 substring 之后的第二個單詞之后拆分。


更新: names比我想象的更復雜

  1. 類似標題的模式已經正確回答了( 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose'
  2. 直到出現第二個字符串模式( 'Mister Kelly, AWS'
  3. 直到第三個字符串模式結束( 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary'

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose', 'Vice president Dr. John Mister Schmid, PRT Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary']

有時Secretary后面跟着不同的規格。 在下一個名字出現之前,我不關心這些有時會跟隨Secretary的角色。 它們可以被丟棄。 當然, 'Secretary'應該存儲在updated_output中。

我為Secretary之后的內容創建了一個——希望是詳盡的——列表specifications 這是列表的表示形式: specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']

更新的問題:我如何使用specification列表來解釋第三種模式?

updated_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose', 'Vice president Dr. John', 'Mister Schmid, PRT', 'Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary of State', 'Dr. Dews, Member', 'Miss Berg, Secretary for Relations, 'Dr. Jakob, Secretary']

您想要在這三個標題之一之前的單詞邊界處拆分,因此您可以查找單詞邊界\b后跟這些標題之一的正前瞻(?=...)

>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    ['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

然后,您可以修剪並丟棄空結果:

>>> v = [x for i in v if (x := i.strip())]
    ['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']

使用輸入字符串列表,只需將此處理應用於所有字符串:

def get_names(s):
    v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    return [x for i in v if (x := i.strip())]


names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

output = []
for n in names:
    output.extend(get_names(n))

這使:

output = ['Acquaintance Muller',
 'Vice president Johnson',
 'Affiliate Peterson',
 'Acquaintance Dr. Rose']

嘗試:

import re

names = [
    "acquaintance Muller",
    "Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]

r = re.compile("|".join(map(re.escape, substrings)))

out = []
for n in names:
    starts = [i.start() for i in r.finditer(n)]

    if not starts:
        out.append(n)
        continue

    if starts[0] != 0:
        starts = [0, *starts]

    starts.append(len(n))
    for a, b in zip(starts, starts[1::]):
        out.append(n[a:b])

print(out)

印刷:

['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM