简体   繁体   English

字符串列表中出现的字符串的双重列表理解

[英]Double list comprehension for occurrences of a string in a list of strings

I have two lists of lists:我有两个列表列表:

text = [['hello this is me'], ['oh you know u']]
phrases = [['this is', 'u'], ['oh you', 'me']]

I need to split the text making word combinations present in phrases a single string:我需要拆分文本,使短语中出现的单词组合成为单个字符串:

result = [['hello', 'this is', 'me'], ['oh you', 'know', 'u']

I tried using zip() but it iterates through the list consecutively, while I need to check each and every list.我尝试使用 zip() 但它连续遍历列表,而我需要检查每个列表。 I also tried a find() method but from this example it would also find all letters 'u' and make them a string (like in word 'you' it makes it 'yo', 'u').我也尝试了一个 find() 方法,但是从这个例子中它也可以找到所有的字母 'u' 并将它们变成一个字符串(就像在单词 'you' 中它变成了 'yo'、'u')。 I wish replace() worked when replacing a string with a list too, because it would let me do something like:我希望 replace() 在用列表替换字符串时也能工作,因为它可以让我执行以下操作:

for line in text:
        line = line.replace('this is', ['this is'])

But trying everything, I still haven't found anything that works for me in this situation.但是尝试了一切,我仍然没有找到在这种情况下对我有用的东西。 Can you help me with that?你能帮我解决这个问题吗?

Clarified with original poster:用原始海报澄清:

Given the text pack my box with five dozen liquor jugs and the phrase five dozen鉴于文本pack my box with five dozen liquor jugs和短语five dozen

the result should be:结果应该是:

['pack', 'my', 'box', 'with', 'five dozen', 'liquor', 'jugs']

not:不是:

['pack my box with', 'five dozen', 'liquor jugs']

Each text and phrase is converted to a Python list of words ['this', 'is', 'an', 'example'] which prevents 'u' being matched inside a word.每个文本和短语都被转换为一个 Python 单词列表['this', 'is', 'an', 'example'] ,从而防止 'u' 在单词中匹配。

All possible subphrases of the text are generated by compile_subphrases() .文本的所有可能子短语都由compile_subphrases()生成。 Longer phrases (more words) are generated first so they are matched before shorter ones.首先生成较长的短语(更多单词),以便在较短的短语之前匹配它们。 'five dozen jugs' would always be matched in preference to 'five dozen' or 'five' . 'five dozen jugs'总是优先匹配'five dozen''five'

Phrase and subphrase are compared using list slices, roughly like this:短语和副短语使用列表切片进行比较,大致如下:

    text = ['five', 'dozen', 'liquor', 'jugs']
    phrase = ['liquor', 'jugs']
    if text[2:3] == phrase:
        print('matched')

Using this method for comparing phrases, the script walks through the original text, rewriting it with the phrases picked out.使用这种比较短语的方法,脚本遍历原始文本,用挑选出来的短语重写它。

texts = [['hello this is me'], ['oh you know u']]
phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
from itertools import chain

def flatten(list_of_lists):
    return list(chain(*list_of_lists))

def compile_subphrases(text, minwords=1, include_self=True):
    words = text.split()
    text_length = len(words)
    max_phrase_length = text_length if include_self else text_length - 1
    # NOTE: longest phrases first
    for phrase_length in range(max_phrase_length + 1, minwords - 1, -1):
        n_length_phrases = (' '.join(words[r:r + phrase_length])
                            for r in range(text_length - phrase_length + 1))
        yield from n_length_phrases
        
def match_sublist(mainlist, sublist, i):
    if i + len(sublist) > len(mainlist):
        return False
    return sublist == mainlist[i:i + len(sublist)]

phrases_to_match = list(flatten(phrases_to_match))
texts = list(flatten(texts))
results = []
for raw_text in texts:
    print(f"Raw text: '{raw_text}'")
    matched_phrases = [
        subphrase.split()
        for subphrase
        in compile_subphrases(raw_text)
        if subphrase in phrases_to_match
    ]
    phrasal_text = []
    index = 0
    text_words = raw_text.split()
    while index < len(text_words):
        for matched_phrase in matched_phrases:
            if match_sublist(text_words, matched_phrase, index):
                phrasal_text.append(' '.join(matched_phrase))
                index += len(matched_phrase)
                break
        else:
            phrasal_text.append(text_words[index])
            index += 1
    results.append(phrasal_text)
print(f'Phrases to match: {phrases_to_match}')
print(f"Results: {results}")

Results:结果:

$python3 main.py
Raw text: 'hello this is me'
Raw text: 'oh you know u'
Phrases to match: ['this is', 'u', 'oh you', 'me']
Results: [['hello', 'this is', 'me'], ['oh you', 'know', 'u']]

For testing this and other answers with larger datasets, try this at the start of the code.要使用更大的数据集测试此答案和其他答案,请在代码开头尝试此操作。 It generates 100s of variations on a single long sentence to simulate 100s of texts.它在单个长句子上生成 100 种变体来模拟 100 种文本。

from itertools import chain, combinations
import random

#texts = [['hello this is me'], ['oh you know u']]
theme = ' '.join([
    'pack my box with five dozen liquor jugs said',
    'the quick brown fox as he jumped over the lazy dog'
])
variations = list([
    ' '.join(combination)
    for combination
    in combinations(theme.split(), 5)
])
texts = random.choices(variations, k=500)
#phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
phrases_to_match = [
    ['pack my box', 'quick brown', 'the quick', 'brown fox'],
    ['jumped over', 'lazy dog'],
    ['five dozen', 'liquor', 'jugs']
]

Try this out.试试这个。

import re

def filter_phrases(phrases):
    phrase_l = sorted(phrases, key=len)
    
    for i, v in enumerate(phrase_l):
        for j in phrase_l[i + 1:]:
            if re.search(rf'\b{v}\b', j):
                phrases.remove(v)
    
    return phrases


text = [
    ['hello this is me'], 
    ['oh you know u'],
    ['a quick brown fox jumps over the lazy dog']
]
phrases = [
    ['this is', 'u'], 
    ['oh you', 'me'],
    ['fox', 'brown fox']
]

# Flatten the `text` and `phrases` list
text = [
    line for l in text 
    for line in l
]
phrases = {
    phrase for l in phrases 
    for phrase in l
}

# If you're quite sure that your phrase
# list doesn't have any overlapping 
# zones, then I strongly recommend 
# against using this `filter_phrases()` 
# function.
phrases = filter_phrases(phrases)

result = []

for line in text:
    # This is the pattern to match the
    # 'space' before the phrases 
    # in the line on which the split
    # is to be done.
    l_phrase_1 = '|'.join([
        f'(?={phrase})' for phrase in phrases
        if re.search(rf'\b{phrase}\b', line)
    ])
    # This is the pattern to match the
    # 'space' after the phrases 
    # in the line on which the split
    # is to be done.
    l_phrase_2 = '|'.join([
        f'(?<={phrase})' for phrase in phrases
        if re.search(rf'\b{phrase}\b', line)
    ])
    
    # Now, we combine the both patterns
    # `l_phrase_1` and `l_phrase_2` to
    # create our master regex. 
    result.append(re.split(
        rf'\s(?:{l_phrase_1})|(?:{l_phrase_2})\s', 
        line
    ))
    
print(result)

# OUTPUT (PRETTY FORM)
#
# [
#     ['hello', 'this is', 'me'], 
#     ['oh you', 'know', 'u'], 
#     ['a quick', 'brown fox', 'jumps over the lazy dog']
# ]

Here, I've used re.split to split before or after phrase in the text.在这里,我使用re.split在文本中的短语之前或之后进行拆分。

This uses Python's best-in-class list slicing.这使用了 Python 一流的列表切片。 phrase[::2] creates a list slice consisting of the 0th, 2nd, 4th, 6th... elements of a list. phrase[::2]创建一个列表切片,由列表的第 0 个、第 2 个、第 4 个、第 6 个...元素组成。 This is the basis of the following solution.这是以下解决方案的基础。

For each phrase, a |对于每个短语,一个| symbol is put either side of found phrases.符号放在找到的短语的两侧。 The following shows 'this is' being marked in 'hello this is me'以下显示在'hello this is me' 'this is'标记了“这是”

'hello this is me' -> 'hello|this is|me'

When the text is split on |当文本在|上拆分时:

['hello', 'this is', 'me']

the even-numbered elements [::2] are non-matches, the odd elements [1::2] are the matched phrases:偶数元素[::2]是不匹配的,奇数元素[1::2]是匹配的短语:

                   0         1       2
unmatched:     ['hello',            'me']
matched:                 'this is',       

If there are different numbers of matched and unmatched elements in the segment, the gaps are filled with empty strings using zip_longest so that there is always a balanced pair of unmatched and matched text:如果段中有不同数量的匹配和不匹配元素,则使用zip_longest用空字符串填充间隙,以便始终存在一对平衡的不匹配和匹配文本:

                   0         1       2     3
unmatched:     ['hello',            'me',     ]
matched:                 'this is',        ''  

For each phrase, the previously unmatched (even-numbered) elements of the text are scanned, the phrase (if found) delimited with |对于每个短语,将扫描文本中先前不匹配的(偶数编号)元素,短语(如果找到)用|分隔。 and the results merged back into the segmented text.并将结果合并回分段文本。

The matched and unmatched segments are merged back into the segmented text using zip() followed by flatten() , taking care to maintain the even (unmatched) and odd (matched) indexes of new and existing text segments.使用zip()后跟flatten()将匹配和不匹配的段合并回分段文本,注意维护新文本段和现有文本段的偶数(不匹配)和奇数(匹配)索引。 The newly-matched phrases are merged back in as odd-numbered elements, so they will not be scanned again for embedded phrases.新匹配的短语作为奇数元素重新合并,因此不会再次扫描它们以查找嵌入的短语。 This prevents conflict between phrases with similar wording like "this is" and "this".这可以防止具有类似措辞(如“this is”和“this”)的短语之间发生冲突。

flatten() is used everywhere. flatten()无处不在。 It finds sub-lists embedded in a larger list and flattens their contents down into the main list:它找到嵌入在更大列表中的子列表,并将其内容扁平化到主列表中:

['outer list 1', ['inner list 1', 'inner list 2'], 'outer list 2']

becomes:变成:

['outer list 1', 'inner list 1', 'inner list 2', 'outer list 2']

This is useful for collecting phrases from multiple embedded lists, as well as merging split or zipped sublists back into the segmented text:这对于从多个嵌入列表中收集短语以及将拆分或压缩的子列表合并回分段文本中很有用:

[['the quick brown fox says', ''], ['hello', 'this is', 'me', '']] ->

['the quick brown fox says', '', 'hello', 'this is', 'me', ''] ->

                   0                        1       2        3          4     5
unmatched:     ['the quick brown fox says',         'hello',            'me',    ]
matched:                                    '',              'this is',       '',

At the very end, the elements that are empty strings, which were just for even-odd alignment, can be removed:最后,可以删除仅用于偶数 alignment 的空字符串元素:

['the quick brown fox says', '', 'hello', 'this is', '', 'me', ''] ->
['the quick brown fox says', 'hello', 'this is', 'me']
texts = [['hello this is me'], ['oh you know u'],
         ['the quick brown fox says hello this is me']]
phrases_to_match = [['this is', 'u'], ['oh you', 'you', 'me']]
from itertools import zip_longest

def flatten(string_list):
    flat = []
    for el in string_list:
        if isinstance(el, list) or isinstance(el, tuple):
            flat.extend(el)
        else:
            flat.append(el)
    return flat

phrases_to_match = flatten(phrases_to_match)
# longer phrases are given priority to avoid problems with overlapping
phrases_to_match.sort(key=lambda phrase: -len(phrase.split()))
segmented_texts = []
for text in flatten(texts):
    segmented_text = text.split('|')
    for phrase in phrases_to_match:
        new_segments = segmented_text[::2]
        delimited_phrase = f'|{phrase}|'
        for match in [f' {phrase} ', f' {phrase}', f'{phrase} ']:
            new_segments = [
                segment.replace(match, delimited_phrase)
                for segment
                in new_segments
            ]
        new_segments = flatten([segment.split('|') for segment in new_segments])
        segmented_text = new_segments if len(segmented_text) == 1 else \
            flatten(zip_longest(new_segments, segmented_text[1::2], fillvalue=''))
    segmented_text = [segment for segment in segmented_text if segment.strip()]
    # option 1: unmatched text is split into words
    segmented_text = flatten([
        segment if segment in phrases_to_match else segment.split()
        for segment
        in segmented_text
    ])
    segmented_texts.append(segmented_text)
print(segmented_texts)

Results:结果:

[['hello', 'this is', 'me'], ['oh you', 'know', 'u'],
 ['the', 'quick', 'brown', 'fox', 'says', 'hello', 'this is', 'me']]

Notice that the phrase 'oh you' has taken precedence over the subset phrase 'you' and there is no conflict.请注意,短语“oh you”优先于子短语“you”,并且没有冲突。

This is a quasi complete answer.这是一个准完整的答案。 Something to get you started:让你开始的东西:

ASSUMPTIONS: looking at your example, I see no reason why the phrases must remain spit, since your 2nd text is splitting on "u" which is in the first list item in "phrases".假设:看你的例子,我看不出为什么短语必须保持唾液,因为你的第二个文本在“短语”中的第一个列表项中的“u”上分裂。

Prep准备

flatten phrases "list-of-lists" into a single list.将短语“list-of-lists”扁平化为一个列表。 I've seen this elsewere an example我见过这 elsewere 一个例子

flatten = lambda t: [item for sublist in t for item in sublist if item != '']

main code:主要代码:

My strategy is to look at each item in the texts list (the beginning it will just be a single item) and attempt split on a phrase in the phrases.我的策略是查看文本列表中的每个项目(开始它只是一个项目)并尝试拆分短语中的一个短语。 If a split is found, a change occurs (which I mark with a flag to keep track), I substitute that list for it's split up counterpart then flatten (so it's all a single list).如果找到拆分,则发生更改(我用标志标记以跟踪),我将该列表替换为拆分的对应项,然后展平(因此它都是一个列表)。 Then start over looping from the beginning IF a change occured (starting over because there's no way to tell if something later in the "phrases" list can also be split earlier)然后如果发生更改,则从头开始循环(重新开始,因为无法判断“短语”列表中后面的内容是否也可以更早拆分)

flatten = lambda t: [item for sublist in t for item in sublist if item != '']

text =[['hello this is me'], ['oh you know u']]
phrases = ['this is','u','oh you', 'me']

output = []
for t in text:
    t_copy = t
    no_change=1
    while no_change:
        for i,tc in enumerate(t_copy):
            for p in phrases:
                before = [tc] # each item is a string, my output is a list, must change to list to "compare apples to apples"
                found = re.split(f'({p})',tc)
                found = [f.strip() for f in found]
                if found != before:
                    t_copy[i] = found
                    t_copy = flatten(t_copy) # flatten to avoid 
                    no_change=0
                    break
                no_change=1
        output.append(t_copy)
print(output)

comments:注释:

I modified the flatten function to remove empty entries.我修改了 flatten function 以删除空条目。 I've noticed if you're splitting on something that occurs at an endpoint, an empty entry is added: ("I love u" split on "u" > ["I love", "u", ''])我注意到,如果您要拆分端点处发生的某些事情,则会添加一个空条目:(“I love u”拆分为“u”> [“I love”,“u”,''])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM