简体   繁体   English

提取两个列表项的任何排列之间的所有子字符串

[英]Extract all substrings between any permutations of two list items

I have a list of strings and a text and I want to find sub-strings in the text that are between one of the items in the first list and one of the items in the second list (with any possible permutation but first items from list1 and second item from list2 ).我有一个字符串列表和一个文本,我想在文本中找到第一个列表中的一个项目和第二个列表中的一个项目之间的子字符串(具有任何可能的排列,但list1中的第一个项目和list2中的第二项)。

text = []
text.append('The cars names are BMW, BENZ and FORD; continue')
text.append('The phones name are Iphone and Samsung.')
list1 = ['car', 'phone', 'PC']
list2 = [';', '.']

list1 and list2 can have six (3*2) permutations. list1list2可以有六个(3*2) 排列。
The desired result should be as follows:期望的结果应该如下:

output = ['s names are BMW, BENZ and FORD', 's name are Iphone and Samsung']

You can use itertools.product to generate all possible pairs of start and end.您可以使用itertools.product生成所有可能的开始和结束对。

We can then build a regex that captures the group between these bounds.然后我们可以构建一个正则表达式来捕获这些边界之间的组。 We must take care to escape them, as they might contain characters (like '.' ) that have special meaning in a regex.我们必须小心转义它们,因为它们可能包含在正则表达式中具有特殊含义的字符(如'.' )。

We just have to use each regex on each sentence of your text.我们只需要在文本的每个句子上使用每个正则表达式。

import re
from itertools import product

text = []
text.append('The cars names are BMW, BENZ and FORD; continue')
text.append('The phones name are Iphone and Samsung.')

list1 = ['car', 'phone', 'PC']
list2 = [';', '.']

bounds = product(list1, list2)
out = []
for start, end in bounds:
    regex = re.compile(re.escape(start) + r'(.*?)' + re.escape(end))
    for sentence in text:
        m = regex.search(sentence)
        if m:
            out.append(m.group(1))
print(out)
# ['s names are BMW, BENZ and FORD', 's name are Iphone and Samsung']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM