潜在语义分析结果

Question

I'm following a tutorial for LSA and having switched the example to a different list of strings, I'm not sure the code is working as expected.我正在学习 LSA 的教程并将示例切换到不同的字符串列表，我不确定代码是否按预期工作。

When I use the example-input as given in the tutorial, it produces sensible answers.当我使用教程中给出的示例输入时，它会产生合理的答案。 However when I use my own inputs, I'm getting very strange results.然而，当我使用自己的输入时，我得到了非常奇怪的结果。

For comparison, here's the results for the example-input:为了进行比较，这里是示例输入的结果：

When I use my own examples, this is the result.当我使用我自己的例子时，这就是结果。 Also worth noting I don't seem to be getting consistent results:同样值得注意的是，我似乎没有得到一致的结果：

Any help in figuring out why I'm getting these results would be greatly appreciated :)任何帮助弄清楚为什么我会得到这些结果将不胜感激:)

Here's the code:这是代码：

import sklearn
# Import all of the scikit learn stuff
from __future__ import print_function
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import Normalizer
from sklearn import metrics
from sklearn.cluster import KMeans, MiniBatchKMeans
import pandas as pd
import warnings
# Suppress warnings from pandas library
warnings.filterwarnings("ignore", category=DeprecationWarning,
module="pandas", lineno=570)
import numpy


example = ["Coffee brewed by expressing or forcing a small amount of 
nearly boiling water under pressure through finely ground coffee 
beans.", 
"An espresso-based coffee drink consisting of espresso with 
microfoam (steamed milk with small, fine bubbles with a glossy or 
velvety consistency)", 
"American fast-food dish, consisting of french fries covered in 
cheese with the possible addition of various other toppings", 
"Pounded and breaded chicken is topped with sweet honey, salty 
dill pickles, and vinegar-y iceberg slaw, then served upon crispy 
challah toast.", 
"A layered, flaky texture, similar to a puff pastry."]

''''
example = ["Machine learning is super fun",
"Python is super, super cool",
"Statistics is cool, too",
"Data science is fun",
"Python is great for machine learning",
"I like football",
"Football is great to watch"]
'''

vectorizer = CountVectorizer(min_df = 1, stop_words = 'english')
dtm = vectorizer.fit_transform(example)
pd.DataFrame(dtm.toarray(),index=example,columns=vectorizer.get_feature_names()).head(10)

# Get words that correspond to each column
vectorizer.get_feature_names()

# Fit LSA. Use algorithm = “randomized” for large datasets
lsa = TruncatedSVD(2, algorithm = 'arpack')
dtm_lsa = lsa.fit_transform(dtm.astype(float))
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)

pd.DataFrame(lsa.components_,index = ["component_1","component_2"],columns = vectorizer.get_feature_names())

pd.DataFrame(dtm_lsa, index = example, columns = "component_1","component_2"])

xs = [w[0] for w in dtm_lsa]
ys = [w[1] for w in dtm_lsa]
xs, ys

# Plot scatter plot of points
%pylab inline
import matplotlib.pyplot as plt
figure()
plt.scatter(xs,ys)
xlabel('First principal component')
ylabel('Second principal component')
title('Plot of points against LSA principal components')
show()

#Plot scatter plot of points with vectors
%pylab inline
import matplotlib.pyplot as plt
plt.figure()
ax = plt.gca()
ax.quiver(0,0,xs,ys,angles='xy',scale_units='xy',scale=1, linewidth = .01)
ax.set_xlim([-1,1])
ax.set_ylim([-1,1])
xlabel('First principal component')
ylabel('Second principal component')
title('Plot of points against LSA principal components')
plt.draw()
plt.show()

# Compute document similarity using LSA components
similarity = np.asarray(numpy.asmatrix(dtm_lsa) * 
numpy.asmatrix(dtm_lsa).T)
pd.DataFrame(similarity,index=example, columns=example).head(10)

Answer 1

The problem looks like it's due to a combination of the small number of examples you're using, and the normalisation step.问题似乎是由于您使用的示例数量少以及规范化步骤所致。 Because the TrucatedSVD maps your count vector to lots of very small numbers and one comparatively large number, when you normalise these you see some strange behaviour.因为TrucatedSVD将您的计数向量映射到许多非常小的数字和一个相对较大的数字，当您对这些数字进行归一化时，您会看到一些奇怪的行为。 You can see this by looking at a scatter plot of your data.您可以通过查看数据的散点图来了解这一点。

dtm_lsa = lsa.fit_transform(dtm.astype(float))
fig, ax = plt.subplots()
for i in range(dtm_lsa.shape[0]):
    ax.scatter(dtm_lsa[i, 0], dtm_lsa[i, 1], label=f'{i+1}')
ax.legend()

I would say that this plot represents your data, as the two coffee examples are out of the way to the right (hard to say much else with a small number of examples).我会说这个图代表了你的数据，因为两个咖啡例子在右边（很难用少量例子说更多）。 However when you normalise the data但是，当您规范化数据时

dtm_lsa = lsa.fit_transform(dtm.astype(float))
dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
fig, ax = plt.subplots()
for i in range(dtm_lsa.shape[0]):
    ax.scatter(dtm_lsa[i, 0], dtm_lsa[i, 1], label=f'{i+1}')
ax.legend()

This pushes some points on top of each other which will give you similarities of 1 .这会将一些点推到彼此之上，这将为您提供1相似性。 The issue will almost certainly disappear the more variance there is, ie the more new samples you add.差异越大，即您添加的新样本越多，这个问题几乎肯定会消失。

潜在语义分析结果

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-09-06 08:55:15

潜在语义分析结果

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-09-06 08:55:15

解决方案1
1 已采纳 2018-09-06 08:55:15