简体   繁体   English

为什么每次运行我的主题权重的迹象都在变化?

[英]Why are the signs of my topic weights changing from run to run?

I'm running the LSI program from Gensim's Topics and Transformations tutorial and for some reason, the signs of the topic weights keep switching from positive to negative and vice versa.我正在运行 Gensim 的主题和转换教程中的 LSI 程序,出于某种原因,主题权重的符号不断从正变为负,反之亦然。 For example, this is what I get when I print using the line例如,这就是我使用该行打印时得到的结果

for doc, as_text in zip(corpus_lsi, documents):
    print(doc, as_text)

Run 1
[(0, 0.066007833960900791), (1, 0.52007033063618491), (2, -0.37649581219168904)]
[(0, 0.196675928591421), (1, 0.7609563167700063), (2, 0.5080674581001664)]
[(0, 0.089926399724459982), (1, 0.72418606267525132), (2, -0.408989731553764)]
[(0, 0.075858476521777865), (1, 0.63205515860034334), (2, -0.53935336057339001)]
[(0, 0.10150299184979866), (1, 0.57373084830029653), (2, 0.67093385852959075)]
[(0, 0.70321089393783254), (1, -0.1611518021402539), (2, -0.18266089635241448)]
[(0, 0.87747876731198449), (1, -0.16758906864658912), (2, -0.10880822642632856)]
[(0, 0.90986246868185872), (1, -0.14086553628718496), (2, 0.00087117874886860625)]
[(0, 0.61658253505692762), (1, 0.053929075663897361), (2, 0.25568697959599318)]

Run 2
[(0, 0.066007833960908563), (1, -0.52007033063618446), (2, -0.37649581219168959)]
[(0, 0.19667592859143226), (1, -0.76095631677000253), (2, 0.50806745810016629)]
[(0, 0.089926399724470751), (1, -0.72418606267525032), (2, -0.40898973155376284)]
[(0, 0.075858476521787177), (1, -0.63205515860034223), (2, -0.5393533605733889)]
[(0, 0.10150299184980684), (1, -0.57373084830029419), (2, 0.67093385852959098)]
[(0, 0.70321089393782976), (1, 0.16115180214026417), (2, -0.18266089635241456)]
[(0, 0.87747876731198149), (1, 0.16758906864660211), (2, -0.10880822642632891)]
[(0, 0.90986246868185627), (1, 0.14086553628719861), (2, 0.00087117874886795399)]
[(0, 0.61658253505692828), (1, -0.053929075663887563), (2, 0.25568697959599251)]

Run 3
[(0, 0.066007833960902929), (1, -0.52007033063618535), (2, 0.37649581219168821)]
[(0, 0.19667592859142491), (1, -0.76095631677000497), (2, -0.50806745810016662)]
[(0, 0.089926399724463771), (1, -0.7241860626752511), (2, 0.40898973155376317)]
[(0, 0.075858476521781085), (1, -0.63205515860034334), (2, 0.5393533605733889)]
[(0, 0.10150299184980124), (1, -0.57373084830029542), (2, -0.67093385852959064)]
[(0, 0.70321089393783143), (1, 0.16115180214025732), (2, 0.18266089635241564)]
[(0, 0.87747876731198304), (1, 0.16758906864659326), (2, 0.10880822642632952)]
[(0, 0.90986246868185761), (1, 0.1408655362871892), (2, -0.00087117874886778746)]
[(0, 0.61658253505692784), (1, -0.053929075663894419), (2, -0.25568697959599318)]

I am running Python 3.5.2 on a PC, coding in IntelliJ.我在 PC 上运行 Python 3.5.2,在 IntelliJ 中编码。

Anyone encountered this problem, using the Gensim library or elsewhere?任何人都遇到过这个问题,使用 Gensim 库或其他地方?

LSI model is nothing but an implementation of fast truncated SVD underneath it. LSI 模型只不过是它下面的快速截断 SVD 的实现。 SVD calculates eigen vectors and these vectors correspond to the topics. SVD 计算特征向量,这些向量对应于主题。 However, eigenvectors remain eigenvectors even after multiplying by -1.然而,即使在乘以 -1 之后,特征向量仍然是特征向量。 So the sign might keep flipping based on the how the algorithm is implemented.因此,符号可能会根据算法的实现方式不断翻转。 In fact it is the case with the SVD implementation of the popular library LAPACK and even the numpy implementation.事实上,流行库 LAPACK 的 SVD 实现甚至 numpy 实现就是这种情况。

The sign really does not matter here, as multiplication by -1 is also an eigen vector.符号在这里真的无关紧要,因为乘以 -1 也是一个特征向量。

There is a number of possibilities:有多种可能性:

  1. Order of the topics can be different.主题的顺序可以不同。 Topic/vocabulary changes between runs.运行之间的主题/词汇变化。 If you run it from scratch every time (incl. vocabulary generation, etc.) there is a possibility that the eventual topics that you see are changing between runs or vocabulary changes between runs which could explain the differences.如果您每次都从头开始运行它(包括词汇生成等),则您看到的最终主题可能会在运行之间发生变化,或者在运行之间发生词汇变化,这可以解释差异。
  2. The calculations are numerically unstable.计算在数值上是不稳定的。 This could happen if there was a value close to 0.0 which could get rounded either to -0.0 or +0.0 (depending on the order of calculation which sometimes can be different) and influence the sign of the result.如果有一个接近 0.0 的值可能会发生这种情况,该值可能会四舍五入为 -0.0 或 +0.0(取决于有时可能不同的计算顺序)并影响结果的符号。 This can be related to a generic numerical bug or a combination of software/hardware that causes it.这可能与通用数字错误或导致它的软件/硬件组合有关。
  3. Some other reason not yet identified.其他一些尚未确定的原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM