简体   繁体   English

拆分包含多个 substring 的字符串

[英]split strings that contain more than one substring

I have a list of strings names我有一个字符串names列表

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

I want to split the strings that contain more than one of the following substrings:我想拆分包含以下多个子字符串的字符串:

substrings = ['Vice president', 'Affiliate', 'Acquaintance']

More precicely, i want to split after the last character of the word that follows the substring更准确地说,我想在 substring 后面的单词的最后一个字符之后拆分

desired_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose']

I dont know how to implement 'more than one' condition into my code:我不知道如何在我的代码中实现“多个”条件:

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|Affiliate|Acquaintance')
    splitted = []
    for i in names:
        if substrings in i:
            splitted.append([])
        splitted[-1].append(item)

Exception: when that last character is a point (eg Prof. ), split after the second word following the substring.例外:当最后一个字符是一个点(例如Prof. )时,在 substring 之后的第二个单词之后拆分。


update: names is more complex than i thought and follows更新: names比我想象的更复杂

  1. the title-like-pattern already answered correctly ( 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose' )类似标题的模式已经正确回答了( 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose'
  2. until a second pattern of strings follows ( 'Mister Kelly, AWS' )直到出现第二个字符串模式( 'Mister Kelly, AWS'
  3. until a third pattern of strings follows until the end ( 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary' )直到第三个字符串模式结束( 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary'

names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose', 'Vice president Dr. John Mister Schmid, PRT Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary']

Sometimes Secretary is followed by varying specifications.有时Secretary后面跟着不同的规格。 I dont care about these characters that sometimes follow Secretary until the next name occurs.在下一个名字出现之前,我不关心这些有时会跟随Secretary的角色。 They can be dropped.它们可以被丢弃。 Of course 'Secretary' should be stored like in updated_output .当然, 'Secretary'应该存储在updated_output中。

I created a - hopefully exhaustive - list specifications of the stuff that follows Secretary .我为Secretary之后的内容创建了一个——希望是详尽的——列表specifications Here is a representation of list: specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']这是列表的表示形式: specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']

updated question : how can i account for the third pattern using the specification list?更新的问题:我如何使用specification列表来解释第三种模式?

updated_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose', 'Vice president Dr. John', 'Mister Schmid, PRT', 'Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary of State', 'Dr. Dews, Member', 'Miss Berg, Secretary for Relations, 'Dr. Jakob, Secretary']

You want to split at the word boundary just before one of those three titles, so you can look for a word boundary \b followed by a positive lookahead (?=...) for one of those titles:您想要在这三个标题之一之前的单词边界处拆分,因此您可以查找单词边界\b后跟这些标题之一的正前瞻(?=...)

>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    ['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

Then, you can trim and discard the empty results:然后,您可以修剪并丢弃空结果:

>>> v = [x for i in v if (x := i.strip())]
    ['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']

With a list of input strings, simply apply this treatment to all of them:使用输入字符串列表,只需将此处理应用于所有字符串:

def get_names(s):
    v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
    return [x for i in v if (x := i.strip())]


names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']

output = []
for n in names:
    output.extend(get_names(n))

Which gives:这使:

output = ['Acquaintance Muller',
 'Vice president Johnson',
 'Affiliate Peterson',
 'Acquaintance Dr. Rose']

Try:尝试:

import re

names = [
    "acquaintance Muller",
    "Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]

r = re.compile("|".join(map(re.escape, substrings)))

out = []
for n in names:
    starts = [i.start() for i in r.finditer(n)]

    if not starts:
        out.append(n)
        continue

    if starts[0] != 0:
        starts = [0, *starts]

    starts.append(len(n))
    for a, b in zip(starts, starts[1::]):
        out.append(n[a:b])

print(out)

Prints:印刷:

['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM