在字符串python中查找匹配的短语和单词

Question

Using python, what would be the most efficient way for one to extract common phrases or words from to given string?使用python，从给定字符串中提取常用短语或单词的最有效方法是什么？

For example,例如，

string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"

Would return:会返回：

["a","time","there","was a very","called Jack"]

How would one go about in doing this efficiently (in my case I would need to do this over thousands of 1000 word documents)?如何有效地做到这一点（在我的情况下，我需要在数千个 1000 字的文档中做到这一点）？

Answer 1

You can split each string, then intersect the set s.您可以split每个字符串，然后intersect set s intersect 。

string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
set(string1.split()).intersection(set(string2.split()))

Result结果

set(['a', 'very', 'Jack', 'time', 'was', 'called'])

Note this only matches individual words.请注意，这仅匹配单个单词。 You have to be more specific on what you would consider a "phrase".您必须更具体地了解您认为的“短语”。 Longest consecutive matching substring?最长的连续匹配子串？ That could get more complicated.那可能会变得更复杂。

Answer 2

In natural language processing, you usually extract common patterns and sequences from sentences using n-grams .在自然语言处理中，您通常使用n-grams从句子中提取常见的模式和序列。 In python, you can use the excellent NLTK module for that.在 python 中，你可以使用优秀的NLTK模块。

For counting and finding the most common, you can use collections.Counter .为了计数和找到最常见的，您可以使用collections.Counter 。

Here's a example for 2-grams:以下是 2 克的示例：

from nltk.util import ngrams
from collections import Counter
from itertools import chain

string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"

n = 2
ngrams1= ngrams(string1.split(" "), n)
ngrams2= ngrams(string2.split(" "), n)

counter= Counter(chain(ngrams1,ngrams2))       #count occurrences of each n-gram
print [k[0] for k,v in counter.items() if v>1] #print all ngrams that come up more than once

output:输出：

[('called', 'Jack'), ('was', 'a'), ('a', 'very')]

output with n=3 :输出n=3 ：

[('was', 'a', 'very')]

output with n=1 (without tuples): n=1输出（没有元组）：

['Jack', 'a', 'was', 'time', 'called', 'very']

Answer 3

This is a classic dynamic programming problem.这是一个经典的动态规划问题。 All you need to do is build a suffix tree for string1 , with words instead of letters (which is the usual formulation).您需要做的就是为string1构建一个后缀树，用单词而不是字母（这是通常的公式）。 Here is an illustrative example of a suffix tree .这是后缀树的说明性示例。

Label all nodes in your tree as s1 .将树中的所有节点标记为s1 。
Insert all suffixes of string2 one by one.将string2所有后缀一一插入。
All nodes that the suffixes in step 2 pass through are labeled s2 .步骤 2 中的后缀所经过的所有节点都标记为s2 。
Any new nodes created in step 2 are also labeled s2 .在步骤 2 中创建的任何新节点也标记为s2 。
In the final tree, path labels of every node labeled both s1 and s2 is a common substring.在最后一棵树中，同时标记为s1和s2的每个节点的路径标签是一个公共子串。

This algorithm is succinctly explained in this lecture note .该算法在本讲义中进行了简要说明。

For two strings of lengths n and m , the suffix tree construction takes O(max(n,m)) , and all the matching substrings (in your case, words or phrases) can be searched in O(#matches) .对于长度为n和m两个字符串，后缀树构造需要O(max(n,m)) ，并且可以在O(#matches)搜索所有匹配的子字符串（在您的情况下，单词或短语O(#matches) 。

Answer 4

A couple of years later, but I tried this way using 'Counter' below:几年后，但我使用下面的“计数器”尝试了这种方式：

Input[ ]:输入[ ]：

from collections import Counter

string1="once upon a time there was a very large giant called Jack"
string2="a very long time ago was a very brave young man called Jack"
string1 += ' ' + string2
string1 = string1.split()

count = Counter(string1)
tag_count = []
for n, c in count.most_common(10):
    dics = {'tag': n, 'count': c}
    tag_count.append(dics)

Output[ ]:输出[ ]：

[{'tag': 'a', 'count': 4},
 {'tag': 'very', 'count': 3},
 {'tag': 'time', 'count': 2},
 {'tag': 'was', 'count': 2},
 {'tag': 'called', 'count': 2},
 {'tag': 'Jack', 'count': 2},
 {'tag': 'once', 'count': 1},
 {'tag': 'upon', 'count': 1},
 {'tag': 'there', 'count': 1},
 {'tag': 'large', 'count': 1}]

Hopefully, it would be useful for someone :)希望它对某人有用:)

在字符串python中查找匹配的短语和单词

问题描述

4 个解决方案

解决方案1
3 2014-09-22 15:26:25

解决方案2
2 2014-09-22 15:50:23

解决方案3
2 2014-09-22 18:03:57

解决方案4
0 2020-02-24 16:47:26

在字符串python中查找匹配的短语和单词

问题描述

4 个解决方案

解决方案1 3 2014-09-22 15:26:25

解决方案2 2 2014-09-22 15:50:23

解决方案3 2 2014-09-22 18:03:57

解决方案4 0 2020-02-24 16:47:26

解决方案1
3 2014-09-22 15:26:25

解决方案2
2 2014-09-22 15:50:23

解决方案3
2 2014-09-22 18:03:57

解决方案4
0 2020-02-24 16:47:26