简体   繁体   English

python-在字符串列表中,查找至少出现在y个条目中的至少具有n个连续标记的所有模式

[英]python - in a list of strings, find all patterns with a minimum of n consecutive tokens that occurs in at least y entries

The task I am trying to accomplish is to write a function that would identify all patterns of at least n tokens that occur in at least y entries when searching within a list of strings. 我要完成的任务是编写一个函数,该函数将在在字符串列表中进行搜索时识别至少出现在y个条目中的至少n个标记的所有模式。

For example: 例如:

list = ["Hello my name is foobar","Hello my favorite food is pizza","Hello my favorite food will never be broccoli","No my name is not barfoo", "Yes my name is foobar"]

Then 然后

function(list, n=3, y=3)
["my name is"]

function(list, n=3, y=2)
["my name is", "my favorite food"]

I would like to use this function with extremely large lists. 我想将此功能与极大的列表一起使用。 I was planning to do this the brute force way with multiple nested loops, but this would be extremely slow. 我正计划使用带有多个嵌套循环的蛮力方式,但这会非常慢。 I am wondering if there are more efficient ways to do this type of a task. 我想知道是否有更有效的方法来执行此类任务。

Here's a quick function to do this. 这是执行此操作的快速功能。 In this function each sentence is broken in n_tokens -grams. 在此函数中,每个句子都分解为n_tokens -grams。 Wrapping set() around the ngrams will ensure that only distinct ngrams are included and if an ngram occurs multiple times in a sentence it won't be double-counted later. set()包裹在ngrams周围将确保仅包含不同的ngrams ,并且如果ngrams在句子中多次出现, ngram后将不会重复计数。 Using itertools the word_grams from the sentences are combined and Counter will count the number of occurrence for each ngram . 使用itertools组合句子中的word_gramsCounter将计算每个ngram的出现ngram Finally, the counts of the gram_occur are evaluated to see what grams occur most frequently. 最后,计算gram_occur的计数,以查看最常出现的克数。 The result is converted into a list of strings that meet your criteria. 结果将转换为符合您条件的字符串列表。

from nltk import ngrams
import itertools
from collections import Counter


def count_ngrams(l, n_tokens, min_occur):
    word_grams = [set(ngrams(s.split(), n_tokens)) for s in l]

    gram_occur = Counter(itertools.chain.from_iterable(word_grams))

    return [" ".join([*words]) for (words, n) in gram_occur.items() if n >= min_occur]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在字符串和数字列表中找到最小值python - find minimum value in list of strings and numbers python Python:在字符串列表中找到X到Y - Python: Find X to Y in a list of strings 如何从难以找到模式的字符串列表中提取标记 - how to extract tokens from list of strings where its hard to find the patterns 在 python 中以最短时间查找列表的所有子列表 - Find all sublist of list in minimum time in python 查找出现在每个列表中的字符串 - Find strings that occurs in every list 在python中的成对列表中查找最小x和y值的最佳方法? - Best way to find minimum x and y values in a list of pairs in python? 在减少“ n”之后,在给定列表中查找最小(最少)唯一数 - To find the minimum(least) number of Unique numbers in a given list after 'n' reductions Python:在作为字符串列表的字典值中查找和替换模式 - Python : find and replace patterns in the value of dictionary that is a list of strings 查找字符串列表中的至少一个字符串是否没有字符(python) - Find if at least one string of a list of strings hasn't a character (python) Python 用运行长度编码找到最小长度的压缩字符串,我们可以删除n个连续的字符以获得最小长度 - Python find minimum length compressed string with run length encoding, we can remove n consecutive chars to get minimum length
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM