[英]Find matching phrases and words in a string python
Using python, what would be the most efficient way for one to extract common phrases or words from to given string?使用python,从给定字符串中提取常用短语或单词的最有效方法是什么?
For example,例如,
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
Would return:会返回:
["a","time","there","was a very","called Jack"]
How would one go about in doing this efficiently (in my case I would need to do this over thousands of 1000 word documents)?如何有效地做到这一点(在我的情况下,我需要在数千个 1000 字的文档中做到这一点)?
You can split
each string, then intersect
the set
s.您可以
split
每个字符串,然后intersect
set
s intersect
。
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
set(string1.split()).intersection(set(string2.split()))
Result结果
set(['a', 'very', 'Jack', 'time', 'was', 'called'])
Note this only matches individual words.请注意,这仅匹配单个单词。 You have to be more specific on what you would consider a "phrase".
您必须更具体地了解您认为的“短语”。 Longest consecutive matching substring?
最长的连续匹配子串? That could get more complicated.
那可能会变得更复杂。
In natural language processing, you usually extract common patterns and sequences from sentences using n-grams
.在自然语言处理中,您通常使用
n-grams
从句子中提取常见的模式和序列。 In python, you can use the excellent NLTK
module for that.在 python 中,你可以使用优秀的
NLTK
模块。
For counting and finding the most common, you can use collections.Counter
.为了计数和找到最常见的,您可以使用
collections.Counter
。
Here's a example for 2-grams:以下是 2 克的示例:
from nltk.util import ngrams
from collections import Counter
from itertools import chain
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
n = 2
ngrams1= ngrams(string1.split(" "), n)
ngrams2= ngrams(string2.split(" "), n)
counter= Counter(chain(ngrams1,ngrams2)) #count occurrences of each n-gram
print [k[0] for k,v in counter.items() if v>1] #print all ngrams that come up more than once
output:输出:
[('called', 'Jack'), ('was', 'a'), ('a', 'very')]
output with n=3
:输出
n=3
:
[('was', 'a', 'very')]
output with n=1
(without tuples): n=1
输出(没有元组):
['Jack', 'a', 'was', 'time', 'called', 'very']
This is a classic dynamic programming problem.这是一个经典的动态规划问题。 All you need to do is build a suffix tree for
string1
, with words instead of letters (which is the usual formulation).您需要做的就是为
string1
构建一个后缀树,用单词而不是字母(这是通常的公式)。 Here is an illustrative example of a suffix tree .这是后缀树的说明性示例。
s1
.s1
。string2
one by one.string2
所有后缀一一插入。s2
.s2
。s2
.s2
。s1
and s2
is a common substring.s1
和s2
的每个节点的路径标签是一个公共子串。 This algorithm is succinctly explained in this lecture note .该算法在 本讲义中进行了简要说明。
For two strings of lengths n
and m
, the suffix tree construction takes O(max(n,m))
, and all the matching substrings (in your case, words or phrases) can be searched in O(#matches)
.对于长度为
n
和m
两个字符串,后缀树构造需要O(max(n,m))
,并且可以在O(#matches)
搜索所有匹配的子字符串(在您的情况下,单词或短语O(#matches)
。
A couple of years later, but I tried this way using 'Counter' below:几年后,但我使用下面的“计数器”尝试了这种方式:
Input[ ]:输入[ ]:
from collections import Counter
string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
string1 += ' ' + string2
string1 = string1.split()
count = Counter(string1)
tag_count = []
for n, c in count.most_common(10):
dics = {'tag': n, 'count': c}
tag_count.append(dics)
Output[ ]:输出[ ]:
[{'tag': 'a', 'count': 4},
{'tag': 'very', 'count': 3},
{'tag': 'time', 'count': 2},
{'tag': 'was', 'count': 2},
{'tag': 'called', 'count': 2},
{'tag': 'Jack', 'count': 2},
{'tag': 'once', 'count': 1},
{'tag': 'upon', 'count': 1},
{'tag': 'there', 'count': 1},
{'tag': 'large', 'count': 1}]
Hopefully, it would be useful for someone :)希望它对某人有用:)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.