[英]How to remove similar keywords from a list in python?
我正在尝试从关键字列表中删除类似的关键字:
keywords=['ps4 pro deals','ps4 pro deals London']
所以我只需要删除类似的“ps4 pro 交易”。 我尝试使用 Leveshtein 距离进行相似性检查的代码:
similar_tags = []
to_be_removed = []
for word1 in keywords:
for word2 in keywords:
if .5 < Levenshtein.token_sort_ratio(word1, word2)< 1 :
to_be_removed.append(word1)
for word in to_be_removed:
if word in keywords:
keywords.remove(word)
此代码删除了两个关键字而不是相似的关键字。
考虑以下简单示例:
words = ['A','B']
for w1 in words:
for w2 in words:
print(w1,w2)
Output:
A A
A B
B A
B B
请注意,有AB
和BA
。 如果AB
确实符合标准,那么BA
也符合标准(因为输入元素的 Levenshtein 距离顺序无关紧要),首先导致添加A
以删除列表,其次导致添加B
以删除列表,因此A
和B
都被删除。
您可以使用以下构造,其中w2
在words
中始终位于w1
之后:
words = ['A','B','C']
for inx, w1 in enumerate(words, 1):
for w2 in words[inx:]:
print(w1,w2)
Output:
A B
A C
B C
解释:对于 word 中的每一个 w1,我都会对单词进行切片,其中包含超出它的元素。 我使用enumerate
来获取需要跳过多少元素的信息,然后切片以跳过它们。
From https://gist.github.com/alvations/a4a6e0cc24d2fd9aff86 , there is an implementation of "Approximate dictionary matching" algorithm described in http://www.aclweb.org/anthology/C10-1096
Input: V : collection of strings
Input: x: query string
Input: α: threshold for the similarity
Output: Y: list of strings similar to the query
1. X ← string to feature(x);
2. Y ←[];
3. for l ← min y(|X|, α) to max y(|X|, α) do
4. τ ← min overlap(|X|, l, α);
5. R ← overlapjoin(X, τ , V , l);
6. foreach r ∈ R do append r to Y;
7. end
8. return Y;
和代码:
from math import sqrt
from collections import defaultdict
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
def get_V(y, n):
"""
Extract a ngram feature matrix for string comparison for a specified n order.
To extract the vectorized feature matrix, we can use the DIY way of
(i) extracting ngrams, (ii) then creating a numpy array for the matrix,
(iii) then fit new queries to the matrix. But since scikit-learn already
have those functions optimized for speed, we should use them.
"""
ngram_vectorizer = CountVectorizer(analyzer='char', ngram_range=(n-1, n), min_df=1)
V = ngram_vectorizer.fit_transform(y)
return V, ngram_vectorizer
def get_V_by_size(V):
"""
Get an "indexed" matrix V, where the key is the size and the values are the
the indices of V that has the size corresponding to the key.
This is a optimization trick. For this to scale, we might have to query a
proper dataframe/database if our V is very very large, otherwise, if it fits
on RAM, then we can simply prune the sequential queries by restricting the
size.
"""
V_by_size = defaultdict(list)
for i, y_vec in enumerate(V):
size_y = np.sum(y_vec.toarray())
V_by_size[size_y].append(i)
return V_by_size
def min_max_y(size_X, alpha):
"""
Computes the size threshold of Y given the size of X and alpha.
"""
return int(alpha * alpha * size_X), int(size_X / (alpha * alpha))
def overlapjoin(vec_x, tau, sub_V, l):
"""
Find the no. of overlap ngrams between the query *vec_x* and the possible
subset of V.
"""
for i, _y in sub_V:
# Possibly this can be optimized by checking through only the
# non-zeros in the sparse matrix but there might be some caveats
# when synchronizing non-zeros of *vec_x* and *_y*.
num_overlaps = sum([1 for x_fi, y_fi in zip(vec_x, _y)
if x_fi&y_fi > 0 and x_fi == y_fi])
if num_overlaps > tau:
yield i
def approx_dict_matching(x, y, V=None, vectorizer=None, V_by_size=None,
n=3, alpha=0.7):
"""
The "approximate dictionary matching" algorithm as described in
http://www.aclweb.org/anthology/C10-1096
:param x: The query word.
:type x: str
:param y: A list of words in the vocabulary.
:type y: str
:param n: The order of ngrams, e.g. n=3 uses trigrams, n=2 uses bigrams
:type n: int
:param alpha: A parameter to specify length threshold of similar words.
:type alpha: float
"""
# Use for optimizing V such that we only instantiate V once outside of the
# approx_dict_matching() function.
if V == vectorizer == V_by_size == None:
V, vectorizer = get_V(y, n)
# Optimization trick:
# Pre-compute a dictionary to index the size of the non-zero values in V.
V_by_size = get_V_by_size(V)
# Note that sklearn assumes a list of queries, thus the [x] .
# Step 1. X ← string to feature(x);
vec_x = vectorizer.transform([x]).toarray()[0]
# print (V.todense()) # Show feature matrix V in human-readable format.
# print (vectorizer.transform([x]).toarray()[0]) # Show vector for query string, x.
# Find the size of X.
size_X = sum(vec_x)
# Get the min and max size of y, given the alpha tolerance paramter.
min_y, max_y = min_max_y(size_X, alpha)
# Step 2. Y ←[];
output = set()
# Step3. for l ← min y(|X|, α) to max y(|X|, α) do
for l in range(min_y, max_y):
# Step 4: τ ← min overlap(|X|, l, α);
tau = alpha * sqrt(size_X * l)
# A subset of V where the words are of size l.
sub_V_indices = V_by_size.get(l)
if sub_V_indices:
sub_V = [(i, V[i].toarray()[0]) for i in sub_V_indices]
# Step 5: R ← overlapjoin(X, τ , V , l);
R = (list(overlapjoin(vec_x, tau, sub_V, l)))
# Step 6: foreach r ∈ R do append r to Y;
output.update(R)
return set([y[i] for i in output])
并使用,也许是这样的:
kw1 = 'ps4 pro deals'
kw2_list = 'ps4 pro deals London'.split()
print (approx_dict_matching(kw1,kw2,n=3, alpha=0.1))
[出去]:
{'ps4', 'deals', 'pro'}
similar_tags = filter(lambda x: [x for i in keywords if x in i and x < i], keywords)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.