[英]split strings that contain more than one substring
I have a list of strings names
我有一个字符串
names
列表
names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
I want to split the strings that contain more than one of the following substrings:我想拆分包含以下多个子字符串的字符串:
substrings = ['Vice president', 'Affiliate', 'Acquaintance']
More precicely, i want to split after the last character of the word that follows the substring更准确地说,我想在 substring 后面的单词的最后一个字符之后拆分
desired_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose']
I dont know how to implement 'more than one' condition into my code:我不知道如何在我的代码中实现“多个”条件:
names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
substrings = re.compile(r'Vice\spresident|Affiliate|Acquaintance')
splitted = []
for i in names:
if substrings in i:
splitted.append([])
splitted[-1].append(item)
Exception: when that last character is a point (eg Prof.
), split after the second word following the substring.例外:当最后一个字符是一个点(例如
Prof.
)时,在 substring 之后的第二个单词之后拆分。
update: names
is more complex than i thought and follows更新:
names
比我想象的更复杂
'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose'
)'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose'
)'Mister Kelly, AWS'
)'Mister Kelly, AWS'
)'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary'
)'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary'
) names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose', 'Vice president Dr. John Mister Schmid, PRT Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary Dr. Dews, Member Miss Berg, Secretary for Relations Dr. Jakob, Secretary']
Sometimes Secretary
is followed by varying specifications.有时
Secretary
后面跟着不同的规格。 I dont care about these characters that sometimes follow Secretary
until the next name occurs.在下一个名字出现之前,我不关心这些有时会跟随
Secretary
的角色。 They can be dropped.它们可以被丢弃。 Of course
'Secretary'
should be stored like in updated_output
.当然,
'Secretary'
应该存储在updated_output
中。
I created a - hopefully exhaustive - list specifications
of the stuff that follows Secretary
.我为
Secretary
之后的内容创建了一个——希望是详尽的——列表specifications
。 Here is a representation of list: specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']
这是列表的表示形式:
specifications = ['', ' of State', ' for Relations', ' for the Interior', ' for the Environment']
updated question : how can i account for the third pattern using the specification
list?更新的问题:我如何使用
specification
列表来解释第三种模式?
updated_output = ['Acquaintance Muller', 'Vice president Johnson', 'Affiliate Peterson', 'Acquaintance Dr. Rose', 'Vice president Dr. John', 'Mister Schmid, PRT', 'Miss Robertson, FDU', 'Mister Kelly, AWS', 'Dr. Birker, Secretary of State', 'Dr. Dews, Member', 'Miss Berg, Secretary for Relations, 'Dr. Jakob, Secretary']
You want to split at the word boundary just before one of those three titles, so you can look for a word boundary \b
followed by a positive lookahead (?=...)
for one of those titles:您想要在这三个标题之一之前的单词边界处拆分,因此您可以查找单词边界
\b
后跟这些标题之一的正前瞻(?=...)
:
>>> s = 'Vice president Johnson affiliate Peterson acquaintance Dr. Rose'
>>> v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
['', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']
Then, you can trim and discard the empty results:然后,您可以修剪并丢弃空结果:
>>> v = [x for i in v if (x := i.strip())]
['Vice president Johnson', 'affiliate Peterson', 'acquaintance Dr. Rose']
With a list of input strings, simply apply this treatment to all of them:使用输入字符串列表,只需将此处理应用于所有字符串:
def get_names(s):
v = re.split(r"\b(?=Vice president|affiliate|acquaintance)", s, flags=re.I)
return [x for i in v if (x := i.strip())]
names = ['Acquaintance Muller', 'Vice president Johnson Affiliate Peterson Acquaintance Dr. Rose']
output = []
for n in names:
output.extend(get_names(n))
Which gives:这使:
output = ['Acquaintance Muller',
'Vice president Johnson',
'Affiliate Peterson',
'Acquaintance Dr. Rose']
Try:尝试:
import re
names = [
"acquaintance Muller",
"Vice president Johnson affiliate Peterson acquaintance Dr. Rose",
]
substrings = ["Vice president", "affiliate", "acquaintance"]
r = re.compile("|".join(map(re.escape, substrings)))
out = []
for n in names:
starts = [i.start() for i in r.finditer(n)]
if not starts:
out.append(n)
continue
if starts[0] != 0:
starts = [0, *starts]
starts.append(len(n))
for a, b in zip(starts, starts[1::]):
out.append(n[a:b])
print(out)
Prints:印刷:
['acquaintance Muller', 'Vice president Johnson ', 'affiliate Peterson ', 'acquaintance Dr. Rose']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.