简体   繁体   English

在python,代码和最有用的可视化中绘制2D矩阵

[英]plotting a 2D matrix in python, code and most useful visualization

I have a very large matrix(10x55678) in "numpy" matrix format. 我有一个非常大的矩阵(10x55678),采用“numpy”矩阵格式。 the rows of this matrix correspond to some "topics" and the columns correspond to words(unique words from a text corpus). 该矩阵的行对应于一些“主题”,列对应于单词(来自文本语料库的唯一单词)。 Each entry i,j in this matrix is a probability, meaning that word j belongs to topic i with probability x. 该矩阵中的每个条目i,j是概率,意味着单词j以概率x属于主题i。 since I am using ids rather than the real words and since the dimension of my matrix is really large I need to visualized it in a way.Which visualization do you suggest? 因为我使用的是ID而不是真实的单词,因为我的矩阵的维度非常大,我需要以某种方式对其进行可视化。您建议使用哪种可视化? a simple plot? 一个简单的情节? or a more sophisticated and informative one?(i am asking these cause I am ignorant about the useful types of visualization). 或者更复杂和信息量更大的一个?(我问这些因为我对可用的可视化类型一无所知)。 If possible can you give me an example that using a numpy matrix? 如果可能,你能给我一个使用numpy矩阵的例子吗? thanks 谢谢

the reason I asked this question is that I want to have a general view of the word-topic distributions in my corpus. 我问这个问题的原因是我希望对我的语料库中的单词主题分布有一个大致的看法。 any other methods are welcome 欢迎任何其他方法

You could certainly use matplotlib's imshow or pcolor method to display the data, but as comments have mentioned, it might be hard to interpret without zooming in on subsets of the data. 您当然可以使用matplotlib的imshowpcolor方法来显示数据,但正如评论所提到的,如果不放大数据的子集,可能很难解释。

a = np.random.normal(0.0,0.5,size=(5000,10))**2
a = a/np.sum(a,axis=1)[:,None]  # Normalize

pcolor(a)

未排序的随机示例

You could then sort the words by the probability that they belong to a cluster: 然后,您可以按照它们属于群集的概率对单词进行排序:

maxvi = np.argsort(a,axis=1)
ii = np.argsort(maxvi[:,-1])

pcolor(a[ii,:])

在此输入图像描述

Here the word index on the y-axis no longer equals the original ordering since things have been sorted. 在这里,y轴上的单词index不再等于原始顺序,因为事物已被排序。

Another possibility is to use the networkx package to plot word clusters for each category, where the words with the highest probability are represented by nodes that are either larger or closer to the center of the graph and ignore those words that have no membership in the category. 另一种可能性是使用networkx包为每个类别绘制单词群集,其中具有最高概率的单词由更大或更接近图表中心的节点表示,并忽略那些没有该类别成员资格的单词。 This might be easier since you have a large number of words and a small number of categories. 这可能更容易,因为您有大量的单词和少量的类别。

Hopefully one of these suggestions is useful. 希望其中一个建议很有用。

The key thing to consider is whether you have important structure along both dimensions in the matrix. 要考虑的关键是你是否在矩阵的两个维度上都有重要的结构 If you do then it's worth trying a colored matrix plot (eg, imshow), but if your ten topics are basically independent, you're probably better off doing ten individual line or histogram plots. 如果你这样做,那么值得尝试一个彩色矩阵图(例如,imshow),但是如果你的十个主题基本上是独立的,你可能最好做十个单独的线或直方图。 Both plots have advantages and disadvantages. 两个图都有优点和缺点。

In particular, in full matrix plots, the z-axis color values are not very precise or quantitative, so its difficult to see, for example, small ripples on a trend, or quantitative assessments of rates of change, etc, so there's a significant cost to these. 特别是在完整​​矩阵图中,z轴颜色值不是非常精确或定量,因此难以看出,例如,趋势上的小波纹,或变化率等的定量评估,因此有显着性这些成本。 And they are also more difficult to pan and zoom since one can get lost and therefore not examine the entire plot, whereas panning along a 1D plot is trivial. 而且它们也更难以平移和缩放,因为人们可能会迷失方向,因此不会检查整个情节,而沿着一维绘图进行平移则是微不足道的。

Also, of course, as others have mentioned, 50K points is too many to actually visualize, so you'll need to sort them, or something, to reduce the number of values that you'll actually need to visually assess. 当然,正如其他人所提到的,50K点太多而无法实际可视化,因此您需要对它们进行排序,或者某些事情,以减少您实际需要进行视觉评估的值的数量。

In practice though, finding a good visualizing technique for a given data set is not always trivial, and for large and complex data sets, people try everything that has a chance of being helpful, and then choose what actually helps. 但在实践中,为给定的数据集找到一个好的可视化技术并不总是微不足道的,对于大型和复杂的数据集,人们会尝试所有有可能有用的东西,然后选择实际有用的东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM