簡體   English   中英

字符串列表中出現的字符串的雙重列表理解

[英]Double list comprehension for occurrences of a string in a list of strings

我有兩個列表列表:

text = [['hello this is me'], ['oh you know u']]
phrases = [['this is', 'u'], ['oh you', 'me']]

我需要拆分文本,使短語中出現的單詞組合成為單個字符串:

result = [['hello', 'this is', 'me'], ['oh you', 'know', 'u']

我嘗試使用 zip() 但它連續遍歷列表,而我需要檢查每個列表。 我也嘗試了一個 find() 方法,但是從這個例子中它也可以找到所有的字母 'u' 並將它們變成一個字符串(就像在單詞 'you' 中它變成了 'yo'、'u')。 我希望 replace() 在用列表替換字符串時也能工作,因為它可以讓我執行以下操作:

for line in text:
        line = line.replace('this is', ['this is'])

但是嘗試了一切,我仍然沒有找到在這種情況下對我有用的東西。 你能幫我解決這個問題嗎?

用原始海報澄清:

鑒於文本pack my box with five dozen liquor jugs和短語five dozen

結果應該是:

['pack', 'my', 'box', 'with', 'five dozen', 'liquor', 'jugs']

不是:

['pack my box with', 'five dozen', 'liquor jugs']

每個文本和短語都被轉換為一個 Python 單詞列表['this', 'is', 'an', 'example'] ,從而防止 'u' 在單詞中匹配。

文本的所有可能子短語都由compile_subphrases()生成。 首先生成較長的短語(更多單詞),以便在較短的短語之前匹配它們。 'five dozen jugs'總是優先匹配'five dozen''five'

短語和副短語使用列表切片進行比較,大致如下:

    text = ['five', 'dozen', 'liquor', 'jugs']
    phrase = ['liquor', 'jugs']
    if text[2:3] == phrase:
        print('matched')

使用這種比較短語的方法,腳本遍歷原始文本,用挑選出來的短語重寫它。

texts = [['hello this is me'], ['oh you know u']]
phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
from itertools import chain

def flatten(list_of_lists):
    return list(chain(*list_of_lists))

def compile_subphrases(text, minwords=1, include_self=True):
    words = text.split()
    text_length = len(words)
    max_phrase_length = text_length if include_self else text_length - 1
    # NOTE: longest phrases first
    for phrase_length in range(max_phrase_length + 1, minwords - 1, -1):
        n_length_phrases = (' '.join(words[r:r + phrase_length])
                            for r in range(text_length - phrase_length + 1))
        yield from n_length_phrases
        
def match_sublist(mainlist, sublist, i):
    if i + len(sublist) > len(mainlist):
        return False
    return sublist == mainlist[i:i + len(sublist)]

phrases_to_match = list(flatten(phrases_to_match))
texts = list(flatten(texts))
results = []
for raw_text in texts:
    print(f"Raw text: '{raw_text}'")
    matched_phrases = [
        subphrase.split()
        for subphrase
        in compile_subphrases(raw_text)
        if subphrase in phrases_to_match
    ]
    phrasal_text = []
    index = 0
    text_words = raw_text.split()
    while index < len(text_words):
        for matched_phrase in matched_phrases:
            if match_sublist(text_words, matched_phrase, index):
                phrasal_text.append(' '.join(matched_phrase))
                index += len(matched_phrase)
                break
        else:
            phrasal_text.append(text_words[index])
            index += 1
    results.append(phrasal_text)
print(f'Phrases to match: {phrases_to_match}')
print(f"Results: {results}")

結果:

$python3 main.py
Raw text: 'hello this is me'
Raw text: 'oh you know u'
Phrases to match: ['this is', 'u', 'oh you', 'me']
Results: [['hello', 'this is', 'me'], ['oh you', 'know', 'u']]

要使用更大的數據集測試此答案和其他答案,請在代碼開頭嘗試此操作。 它在單個長句子上生成 100 種變體來模擬 100 種文本。

from itertools import chain, combinations
import random

#texts = [['hello this is me'], ['oh you know u']]
theme = ' '.join([
    'pack my box with five dozen liquor jugs said',
    'the quick brown fox as he jumped over the lazy dog'
])
variations = list([
    ' '.join(combination)
    for combination
    in combinations(theme.split(), 5)
])
texts = random.choices(variations, k=500)
#phrases_to_match = [['this is', 'u'], ['oh you', 'me']]
phrases_to_match = [
    ['pack my box', 'quick brown', 'the quick', 'brown fox'],
    ['jumped over', 'lazy dog'],
    ['five dozen', 'liquor', 'jugs']
]

試試這個。

import re

def filter_phrases(phrases):
    phrase_l = sorted(phrases, key=len)
    
    for i, v in enumerate(phrase_l):
        for j in phrase_l[i + 1:]:
            if re.search(rf'\b{v}\b', j):
                phrases.remove(v)
    
    return phrases


text = [
    ['hello this is me'], 
    ['oh you know u'],
    ['a quick brown fox jumps over the lazy dog']
]
phrases = [
    ['this is', 'u'], 
    ['oh you', 'me'],
    ['fox', 'brown fox']
]

# Flatten the `text` and `phrases` list
text = [
    line for l in text 
    for line in l
]
phrases = {
    phrase for l in phrases 
    for phrase in l
}

# If you're quite sure that your phrase
# list doesn't have any overlapping 
# zones, then I strongly recommend 
# against using this `filter_phrases()` 
# function.
phrases = filter_phrases(phrases)

result = []

for line in text:
    # This is the pattern to match the
    # 'space' before the phrases 
    # in the line on which the split
    # is to be done.
    l_phrase_1 = '|'.join([
        f'(?={phrase})' for phrase in phrases
        if re.search(rf'\b{phrase}\b', line)
    ])
    # This is the pattern to match the
    # 'space' after the phrases 
    # in the line on which the split
    # is to be done.
    l_phrase_2 = '|'.join([
        f'(?<={phrase})' for phrase in phrases
        if re.search(rf'\b{phrase}\b', line)
    ])
    
    # Now, we combine the both patterns
    # `l_phrase_1` and `l_phrase_2` to
    # create our master regex. 
    result.append(re.split(
        rf'\s(?:{l_phrase_1})|(?:{l_phrase_2})\s', 
        line
    ))
    
print(result)

# OUTPUT (PRETTY FORM)
#
# [
#     ['hello', 'this is', 'me'], 
#     ['oh you', 'know', 'u'], 
#     ['a quick', 'brown fox', 'jumps over the lazy dog']
# ]

在這里,我使用re.split在文本中的短語之前或之后進行拆分。

這使用了 Python 一流的列表切片。 phrase[::2]創建一個列表切片,由列表的第 0 個、第 2 個、第 4 個、第 6 個...元素組成。 這是以下解決方案的基礎。

對於每個短語,一個| 符號放在找到的短語的兩側。 以下顯示在'hello this is me' 'this is'標記了“這是”

'hello this is me' -> 'hello|this is|me'

當文本在|上拆分時

['hello', 'this is', 'me']

偶數元素[::2]是不匹配的,奇數元素[1::2]是匹配的短語:

                   0         1       2
unmatched:     ['hello',            'me']
matched:                 'this is',       

如果段中有不同數量的匹配和不匹配元素,則使用zip_longest用空字符串填充間隙,以便始終存在一對平衡的不匹配和匹配文本:

                   0         1       2     3
unmatched:     ['hello',            'me',     ]
matched:                 'this is',        ''  

對於每個短語,將掃描文本中先前不匹配的(偶數編號)元素,短語(如果找到)用|分隔。 並將結果合並回分段文本。

使用zip()后跟flatten()將匹配和不匹配的段合並回分段文本,注意維護新文本段和現有文本段的偶數(不匹配)和奇數(匹配)索引。 新匹配的短語作為奇數元素重新合並,因此不會再次掃描它們以查找嵌入的短語。 這可以防止具有類似措辭(如“this is”和“this”)的短語之間發生沖突。

flatten()無處不在。 它找到嵌入在更大列表中的子列表,並將其內容扁平化到主列表中:

['outer list 1', ['inner list 1', 'inner list 2'], 'outer list 2']

變成:

['outer list 1', 'inner list 1', 'inner list 2', 'outer list 2']

這對於從多個嵌入列表中收集短語以及將拆分或壓縮的子列表合並回分段文本中很有用:

[['the quick brown fox says', ''], ['hello', 'this is', 'me', '']] ->

['the quick brown fox says', '', 'hello', 'this is', 'me', ''] ->

                   0                        1       2        3          4     5
unmatched:     ['the quick brown fox says',         'hello',            'me',    ]
matched:                                    '',              'this is',       '',

最后,可以刪除僅用於偶數 alignment 的空字符串元素:

['the quick brown fox says', '', 'hello', 'this is', '', 'me', ''] ->
['the quick brown fox says', 'hello', 'this is', 'me']
texts = [['hello this is me'], ['oh you know u'],
         ['the quick brown fox says hello this is me']]
phrases_to_match = [['this is', 'u'], ['oh you', 'you', 'me']]
from itertools import zip_longest

def flatten(string_list):
    flat = []
    for el in string_list:
        if isinstance(el, list) or isinstance(el, tuple):
            flat.extend(el)
        else:
            flat.append(el)
    return flat

phrases_to_match = flatten(phrases_to_match)
# longer phrases are given priority to avoid problems with overlapping
phrases_to_match.sort(key=lambda phrase: -len(phrase.split()))
segmented_texts = []
for text in flatten(texts):
    segmented_text = text.split('|')
    for phrase in phrases_to_match:
        new_segments = segmented_text[::2]
        delimited_phrase = f'|{phrase}|'
        for match in [f' {phrase} ', f' {phrase}', f'{phrase} ']:
            new_segments = [
                segment.replace(match, delimited_phrase)
                for segment
                in new_segments
            ]
        new_segments = flatten([segment.split('|') for segment in new_segments])
        segmented_text = new_segments if len(segmented_text) == 1 else \
            flatten(zip_longest(new_segments, segmented_text[1::2], fillvalue=''))
    segmented_text = [segment for segment in segmented_text if segment.strip()]
    # option 1: unmatched text is split into words
    segmented_text = flatten([
        segment if segment in phrases_to_match else segment.split()
        for segment
        in segmented_text
    ])
    segmented_texts.append(segmented_text)
print(segmented_texts)

結果:

[['hello', 'this is', 'me'], ['oh you', 'know', 'u'],
 ['the', 'quick', 'brown', 'fox', 'says', 'hello', 'this is', 'me']]

請注意,短語“oh you”優先於子短語“you”,並且沒有沖突。

這是一個准完整的答案。 讓你開始的東西:

假設:看你的例子,我看不出為什么短語必須保持唾液,因為你的第二個文本在“短語”中的第一個列表項中的“u”上分裂。

准備

將短語“list-of-lists”扁平化為一個列表。 我見過這 elsewere 一個例子

flatten = lambda t: [item for sublist in t for item in sublist if item != '']

主要代碼:

我的策略是查看文本列表中的每個項目(開始它只是一個項目)並嘗試拆分短語中的一個短語。 如果找到拆分,則發生更改(我用標志標記以跟蹤),我將該列表替換為拆分的對應項,然后展平(因此它都是一個列表)。 然后如果發生更改,則從頭開始循環(重新開始,因為無法判斷“短語”列表中后面的內容是否也可以更早拆分)

flatten = lambda t: [item for sublist in t for item in sublist if item != '']

text =[['hello this is me'], ['oh you know u']]
phrases = ['this is','u','oh you', 'me']

output = []
for t in text:
    t_copy = t
    no_change=1
    while no_change:
        for i,tc in enumerate(t_copy):
            for p in phrases:
                before = [tc] # each item is a string, my output is a list, must change to list to "compare apples to apples"
                found = re.split(f'({p})',tc)
                found = [f.strip() for f in found]
                if found != before:
                    t_copy[i] = found
                    t_copy = flatten(t_copy) # flatten to avoid 
                    no_change=0
                    break
                no_change=1
        output.append(t_copy)
print(output)

注釋:

我修改了 flatten function 以刪除空條目。 我注意到,如果您要拆分端點處發生的某些事情,則會添加一個空條目:(“I love u”拆分為“u”> [“I love”,“u”,''])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM