简体   繁体   English

可视化文本之间的距离

[英]Visualise distances between texts

I'm working on a research project for school. 我正在为学校开展研究项目。 I've written some text mining software that analyzes legal texts in a collection and spits out a score that indicates how similar they are. 我写了一些文本挖掘软件,分析集合中的法律文本,并吐出一个分数,表明它们有多相似。 I ran the program to compare each text with every other text, and I have data like this (although with many more points): 我运行程序来比较每个文本和其他所有文本,我有这样的数据(虽然有更多的点):

codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582

Now I need to plot them on a graph. 现在我需要在图表上绘制它们。 I can easily invert the scores so that a small value now indicates texts that are similar and a large value indicates texts that are dissimilar: the value can be the distance between points on a graph representing the texts. 我可以轻松地反转分数,以便现在小值表示相似的文本,大值表示不相似的文本:值可以是表示文本的图形上的点之间的距离。

codeofhammurabi.txt crete.txt      1.75212
codeofhammurabi.txt iraqi.txt      0.8812
codeofhammurabi.txt magnacarta.txt 1.0573
codeofhammurabi.txt us.txt         0.7965
crete.txt iraqi.txt                3.0344
crete.txt magnacarta.txt           1.6955
crete.txt us.txt                   2.0329
iraqi.txt magnacarta.txt           1.1983
iraqi.txt us.txt                   0.7261
magnacarta.txt us.txt              0.9125

SHORT VERSION: Those values directly above are distances between points on a scatter plot (1.75212 is the distance between the codeofhammurabi point and the crete point). 简短版本:上面的值是散点图上点之间的距离(1.75212是codeofhammurabi点和克里特点之间的距离)。 I can imagine a big system of equations with circles representing the distances between points. 我可以想象一个大的方程组,圆圈表示点之间的距离。 What's the best way to make this graph? 制作此图表的最佳方法是什么? I have MATLAB, R, Excel, and access to pretty much any software I might need. 我有MATLAB,R,Excel,并且可以访问我可能需要的任何软件。

If you can even point me in a direction, I'll be infinitely grateful. 如果你甚至可以指出我的方向,我将无限感激。

Your data are really distances (of some form) in the multivariate space spanned by the corpus of words contained in the documents. 您的数据实际上是由文档中包含的单词语料库跨越的多变量空间中的某种形式的距离(某种形式)。 Dissimilarity data such as these are often ordinated to provide the best k -d mapping of the dissimilarities. 诸如这些的不相似性数据通常被用于提供不同的最佳k- d映射。 Principal coordinates analysis and non-metric multidimensional scaling are two such methods. 主坐标分析和非度量多维缩放是两种这样的方法。 I would suggest you plot the results of applying one or the other of these methods to your data. 我建议你绘制将这些方法中的一种或另一种应用于数据的结果。 I provide examples of both below. 我在下面提供了两个例子。

First, load in the data you supplied (without labels at this stage) 首先,加载您提供的数据(此阶段没有标签)

con <- textConnection("1.75212
0.8812
1.0573
0.7965
3.0344
1.6955
2.0329
1.1983
0.7261
0.9125
")
vec <- scan(con)
close(con)

What you effectively have is the following distance matrix: 你实际拥有的是以下距离矩阵:

mat <- matrix(ncol = 5, nrow = 5)
mat[lower.tri(mat)] <- vec
colnames(mat) <- rownames(mat) <-
  c("codeofhammurabi","crete","iraqi","magnacarta","us")

> mat
                codeofhammurabi  crete  iraqi magnacarta us
codeofhammurabi              NA     NA     NA         NA NA
crete                   1.75212     NA     NA         NA NA
iraqi                   0.88120 3.0344     NA         NA NA
magnacarta              1.05730 1.6955 1.1983         NA NA
us                      0.79650 2.0329 0.7261     0.9125 NA

R, in general, needs a dissimilarity object of class "dist" . 通常,R需要类"dist"的相异对象。 We could use as.dist(mat) now to get such an object, or we could skip creating mat and go straight to the "dist" object like this: 我们现在可以使用as.dist(mat)来获取这样的对象,或者我们可以跳过创建mat并直接转到"dist"对象,如下所示:

class(vec) <- "dist"
attr(vec, "Labels") <- c("codeofhammurabi","crete","iraqi","magnacarta","us")
attr(vec, "Size") <- 5
attr(vec, "Diag") <- FALSE
attr(vec, "Upper") <- FALSE

> vec
           codeofhammurabi   crete   iraqi magnacarta
crete              1.75212                           
iraqi              0.88120 3.03440                   
magnacarta         1.05730 1.69550 1.19830           
us                 0.79650 2.03290 0.72610    0.91250

Now we have an object of the right type we can ordinate it. 现在我们有一个正确类型的对象,我们可以将其纵坐标。 R has many packages and functions for doing this (see the Multivariate or Environmetrics Task Views on CRAN), but I'll use the vegan package as I am somewhat familiar with it... R有许多用于执行此操作的软件包和函数(请参阅CRAN上的多变量Environmetrics任务视图),但我将使用纯素包,因为我对它有点熟悉...

require("vegan")

Principal coordinates 主要坐标

First I illustrate how to do principal coordinates analysis on your data using vegan . 首先,我将介绍如何使用素食主义者对数据进行主坐标分析。

pco <- capscale(vec ~ 1, add = TRUE)
pco

> pco
Call: capscale(formula = vec ~ 1, add = TRUE)

              Inertia Rank
Total           10.42     
Unconstrained   10.42    3
Inertia is squared Unknown distance (euclidified) 

Eigenvalues for unconstrained axes:
 MDS1  MDS2  MDS3 
7.648 1.672 1.098 

Constant added to distances: 0.7667353

The first PCO axis is by far the most important at explaining the between text differences, as exhibited by the Eigenvalues. 第一个PCO轴是解释文本差异之间最重要的,如特征值所示。 An ordination plot can now be produced by plotting the Eigenvectors of the PCO, using the plot method 现在可以通过使用plot方法绘制PCO的特征向量来生成排序图

plot(pco)

which produces 哪个产生

在此输入图像描述

Non-metric multidimensional scaling 非度量多维缩放

A non-metric multidimensional scaling (nMDS) does not attempt to find a low dimensional representation of the original distances in an Euclidean space. 非度量多维缩放(nMDS)不会尝试在欧几里德空间中找到原始距离的低维表示。 Instead it tries to find a mapping in k dimensions that best preserves the rank ordering of the distances between observations. 相反,它试图找到k维度的映射,最好地保留观察之间距离的等级排序。 There is no closed-form solution to this problem (unlike the PCO applied above) and an iterative algorithm is required to provide a solution. 对于该问题没有封闭形式的解决方案(与上面应用的PCO不同),并且需要迭代算法来提供解决方案。 Random starts are advised to assure yourself that the algorithm hasn't converged to a sub-optimal, locally optimal solution. 建议随机启动以确保该算法未收敛到次优的局部最优解。 Vegan's metaMDS function incorporates these features and more besides. 素食主义者的metaMDS功能包含了这些功能以及更多功能。 If you want plain old nMDS, then see isoMDS in package MASS . 如果您想要普通的旧nMDS,请参阅MASS isoMDS中的isoMDS

set.seed(42)
sol <- metaMDS(vec)

> sol

Call:
metaMDS(comm = vec) 

global Multidimensional Scaling using monoMDS

Data:     vec 
Distance: user supplied 

Dimensions: 2 
Stress:     0 
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation 
Species: scores missing

With this small data set we can essentially represent the rank ordering of the dissimilarities perfectly (hence the warning, not shown). 利用这个小数据集,我们基本上可以完美地表示不相似性的等级排序(因此警告,未示出)。 A plot can be achieved using the plot method 可以使用plot方法实现plot

plot(sol, type = "text", display = "sites")

which produces 哪个产生

在此输入图像描述

In both cases the distance on the plot between samples is the best 2-d approximation of their dissimilarity. 在两种情况下,样本之间的图上的距离是它们的不相似性的最佳2-d近似值。 In the case of the PCO plot, it is a 2-d approximation of the real dissimilarity (3 dimensions are needed to represent all of the dissimilarities fully), whereas in the nMDS plot, the distance between samples on the plot reflects the rank dissimilarity not the actual dissimilarity between observations. 在PCO图的情况下,它是真实不相似性的二维近似(需要3个维度来完全表示所有相异性),而在nMDS图中,图上样本之间的距离反映了等级差异性不是观察之间的实际差异。 But essentially distances on the plot represent the computed dissimilarities. 但基本上图上的距离代表计算的不相似性。 Texts that are close together are most similar, texts located far apart on the plot are the most dissimilar to one another. 靠近在一起的文本最相似,在图上相隔很远的文本彼此最不相同。

If the question is 'how I can do something like this guy did?' 如果问题是“我怎么可以这样做这家伙没有?” (from xiii1408's comment to the question), then the answer is use Gephi's built-in Force Atlas 2 algorithm on Euclidean distances of document topic posterior probabilities . (从xiii1408的问题评论),然后答案是使用Gephi内置的Force Atlas 2算法对文档主题后验概率的欧几里德距离

"This guy" is Matt Jockers, who is an innovative scholar in the digital humanities. “这个人”是Matt Jockers,他是数字人文学科的创新学者。 He has documented some of his methods on his blog and else where , etc. Jockers mostly works in R and shares some of his code . 他已经在他的博客上记录了他的一些方法,以及其他 地方等等 .Jockers主要在R分享他的一些代码 His basic work flow seems to be: 他的基本工作流程似乎是:

  1. break plain text into 1000-word chunks, 将纯文本分成1000个字块,
  2. remove stopwords (don't stem), 删除停用词(不要干),
  3. do part-of-speech tagging and keep nouns only, 进行词性标注并仅保留名词,
  4. build a topic model (using LDA), 构建一个主题模型(使用LDA),
  5. calculate Euclidean distances between documents based on topic proportions, subset the distances to keep only ones below a certain threshold, and then 根据主题比例计算文档之间的欧几里德距离,将距离子集化为仅保留低于某个阈值的距离,然后
  6. visualise with a force-directed graph 用力导向图可视化

Here's a small-scale reproducible example in R (with an export to Gephi) that might be close to what Jockers did: 这是R的一个小规模可重现的例子(导出到Gephi)可能接近Jockers所做的:

#### prepare workspace
# delete current objects and clear RAM
rm(list = ls(all.names = TRUE))
gc()

Get data... 获取数据......

#### import text
# working from the topicmodels package vignette
# using collection of abstracts of the Journal of Statistical Software (JSS) (up to 2010-08-05).
install.packages("corpus.JSS.papers", repos = "http://datacube.wu.ac.at/", type = "source")
data("JSS_papers", package = "corpus.JSS.papers")
# For reproducibility of results we use only abstracts published up to 2010-08-05 
JSS_papers <- JSS_papers[JSS_papers[,"date"] < "2010-08-05",]

Clean and reshape... 清洁和重塑......

#### clean and reshape data
# Omit abstracts containing non-ASCII characters in the abstracts
JSS_papers <- JSS_papers[sapply(JSS_papers[, "description"], Encoding) == "unknown",]
# remove greek characters (from math notation, etc.)
library("tm")
library("XML")
remove_HTML_markup <- function(s) tryCatch({
    doc <- htmlTreeParse(paste("<!DOCTYPE html>", s),
                         asText = TRUE, trim = FALSE)
                         xmlValue(xmlRoot(doc))
                         }, error = function(s) s)
# create corpus
corpus <- Corpus(VectorSource(sapply(JSS_papers[, "description"], remove_HTML_markup)))
# clean corpus by removing stopwords, numbers, punctuation, whitespaces, words <3 characters long..
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus_clean <- tm_map(corpus, wordLengths=c(3,Inf), FUN = tm_reduce, tmFuns = funcs)

Part of speech tagging and sub-setting of nouns... 部分语音标记和名词子设置......

#### Part-of-speach tagging to extract nouns only
library("openNLP", "NLP")
# function for POS tagging
tagPOS <-  function(x) {

  s <- NLP::as.String(x)
  ## Need sentence and word token annotations.

  a1 <- NLP::Annotation(1L, "sentence", 1L, nchar(s))
  a2 <- NLP::annotate(s, openNLP::Maxent_Word_Token_Annotator(), a1)
  a3 <- NLP::annotate(s,  openNLP::Maxent_POS_Tag_Annotator(), a2)

  ## Determine the distribution of POS tags for word tokens.
  a3w <- a3[a3$type == "word"]
  POStags <- unlist(lapply(a3w$features, `[[`, "POS"))

  ## Extract token/POS pairs (all of them): easy - not needed
  # POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
  return(unlist(POStags))
} 
# a  loop to do POS tagging on each document and do garbage cleaning after each document
# first prepare vector to hold results (for optimal loop speed)
corpus_clean_tagged <- vector(mode = "list",  length = length(corpus_clean))
# then loop through each doc and do POS tagging
# warning: this may take some time!
for(i in 1:length(corpus_clean)){
  corpus_clean_tagged[[i]] <- tagPOS(corpus_clean[[i]])
  print(i) # nice to see what we're up to
  gc()
}

# subset nouns
wrds <- lapply(unlist(corpus_clean), function(i) unlist(strsplit(i, split = " ")))
NN <- lapply(corpus_clean_tagged, function(i) i == "NN")
Noun_strings <- lapply(1:length(wrds), function(i) unlist(wrds[i])[unlist(NN[i])])
Noun_strings <- lapply(Noun_strings, function(i) paste(i, collapse = " "))
# have a look to see what we've got
Noun_strings[[1]]
[8] "variogram model splus user quality variogram model pairs locations measurements variogram nonstationarity outliers variogram fit sets soil nitrogen concentration"

Topic modelling with latent Dirichlet allocation... 使用潜在Dirichlet分配的主题建模...

#### topic modelling with LDA (Jockers uses the lda package and MALLET, maybe topicmodels also, I'm not sure. I'm most familiar with the topicmodels package, so here it is. Note that MALLET can be run from R: https://gist.github.com/benmarwick/4537873
# put the cleaned documents back into a corpus for topic modelling
corpus <- Corpus(VectorSource(Noun_strings))
# create document term matrix 
JSS_dtm <- DocumentTermMatrix(corpus)
# generate topic model 
library("topicmodels")
k = 30 # arbitrary number of topics (they are ways to optimise this)
JSS_TM <- LDA(JSS_dtm, k) # make topic model
# make data frame where rows are documents, columns are topics and cells 
# are posterior probabilities of topics
JSS_topic_df <- setNames(as.data.frame(JSS_TM@gamma),  paste0("topic_",1:k))
# add row names that link each document to a human-readble bit of data
# in this case we'll just use a few words of the title of each paper
row.names(JSS_topic_df) <- lapply(1:length(JSS_papers[,1]), function(i) gsub("\\s","_",substr(JSS_papers[,1][[i]], 1, 60)))

Calculate Euclidean distances of one document from another using topics probabilities as the document's 'DNA' 使用主题概率作为文档的“DNA”计算一个文档与另一个文档的欧几里德距离

#### Euclidean distance matrix
library(cluster)
JSS_topic_df_dist <-  as.matrix(daisy(JSS_topic_df, metric =  "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# This is how Jockers subsets the distance matrix to keep only 
# closely related documents and avoid a dense spagetti diagram 
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
JSS_topic_df_dist[ sweep(JSS_topic_df_dist, 1, (apply(JSS_topic_df_dist,1,min) + apply(JSS_topic_df_dist,1,sd) )) > 0 ] <- 0

Visualize using a force-directed graph... 使用力导向图可视化...

#### network diagram using Fruchterman & Reingold algorithm (Jockers uses the ForceAtlas2 algorithm which is unique to Gephi)
library(igraph)
g <- as.undirected(graph.adjacency(JSS_topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1,  vertex.color= "grey", edge.arrow.size = 0.1, vertex.label.dist=0.5, vertex.label = NA)

在此输入图像描述 And if you want to use the Force Atlas 2 algorithm in Gephi you simply export the R graph object to a graphml file and then open it in Gephi and set the layout to Force Atlas 2: 如果你想在Gephi中使用Force Atlas 2算法,你只需将R图形对象导出到graphml文件中,然后在Gephi中打开它并将布局设置为Force Atlas 2:

# this line will export from R and make the file 'JSS.graphml' in your working directory ready to open with Gephi
write.graph(g, file="JSS.graphml", format="graphml") 

Here's the Gephi plot with the Force Atlas 2 algorithm: 这是使用Force Atlas 2算法的Gephi图: 在此输入图像描述

You could do a network graph using igraph. 你可以使用igraph做一个网络图。 The Fruchterman-Reingold layout has a parameter to provide edge weights. Fruchterman-Reingold布局具有提供边缘权重的参数。 Weights bigger than 1 result in more "attraction" along the edges, weights less than 1 do the opposite. 大于1的权重导致沿边缘的“吸引力”更多,权重小于1则相反。 In your example, crete.txt has the lowest distance and sits in the middle and has smaller edges to other vertices. 在您的示例中,crete.txt具有最低距离并位于中间并且具有到其他顶点的较小边缘。 In fact, it is closer to iraqi.txt. 事实上,它更接近iraqi.txt。 Note that you have to inverse the data for E(g)$weight to get the correct distances. 请注意,您必须反转E(g)$ weight的数据才能获得正确的距离。

data1 <- read.table(text="
codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582")
par(mar=c(3,7,3.5,5), las=1)

library(igraph)
g <- graph.data.frame(data1, directed = FALSE)
E(g)$weight <- 1/data1[,3] #inversed, high weights = more attraction along the edges
l <- layout.fruchterman.reingold(g, weights=E(g)$weight)
plot(g, layout=l)

在此输入图像描述

Are you making all pairwise comparisons? 你正在进行所有成对比较吗? Depends on how you calculate the distance(similarity), I am not sure if it is possible to make such a scatter plot. 取决于你如何计算距离(相似度),我不确定是否可以制作这样的散点图。 so when you have only 3 text file to consider, your scatter plot is easy to make (triangle with sides equal the distances). 因此,当您只考虑3个文本文件时,您的散点图很容易制作(边长等于距离的三角形)。 but when you add the fourth point, you might not be able to place it in a location where its distances to the existing 3 points satisfy all constraints. 但是当您添加第四个点时,您可能无法将其放置在与现有3个点的距离满足所有约束的位置。

But if you can do that, than you have a solution, just add new points on and on....I think... Or, if you don't need the distances on the scatter plot to be precise, you can simply make a web and label the distance. 但是,如果你能做到这一点,那么你只需要一个解决方案,就可以在上面添加新点......我想......或者,如果你不需要散点图上的距离是精确的,你可以简单地制作网页并标记距离。

Here's a potential solution for Matlab: 这是Matlab的潜在解决方案:

You can arrange your data into a formal 5x5 similarity matrix S where element S(i,j) represents your similarity (or dissimilarity) between the document i and document j . 您可以将数据排列成正式的5x5相似度矩阵S ,其中元素S(i,j)表示文档i和文档j之间的相似性(或不相似性)。 Assuming your distance measure is an actual metric , you can apply multi-dimensional scaling to this matrix via mdscale(S,2) . 假设您的距离度量是实际度量 ,您可以通过mdscale(S,2)对此矩阵应用多维缩放

This function will attempt to find a 5x2 dimensional representation of your data that preserves the similarity (or dissimilarity) between your classes found in the higher dimensions. 此函数将尝试查找数据的5x2维表示,以保留在较高维度中找到的类之间的相似性(或不相似性)。 You can then visualize this data as a scatterplot of 5 points. 然后,您可以将此数据可视化为5个点的散点图。

You could also potentially try this using mdscale(S,3) to project into a 5x3 dimensional matrix which you can then visualize with plot3(). 您还可以尝试使用mdscale(S,3)投影到5x3维矩阵,然后可以使用plot3()进行可视化。

If you want circles representing the distances between points, this would work in R (I used the first table in your example): 如果你想要圆圈代表点之间的距离,这将在R中起作用(我使用了你的例子中的第一个表):

data1 <- read.table(text="
codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582")
par(mar=c(3,7,3.5,5), las=1)

symbols(data1[,1],data1[,2], circles=data1[,3], inches=0.55, bg="lightblue", xaxt="n", yaxt="n", ylab="")
axis(1, at=data1[,1],labels=data1[,1])
axis(2, at=data1[,2],labels=data1[,2])
text(data1[,1], data1[,2], round(data1[,3],2), cex=0.9)

在此输入图像描述

This Matlab snippet should work if you want to try a 3D bar view: 如果您想尝试3D条形视图,此Matlab代码段应该有效:

% Load data from file 'dist.dat', with values separated by spaces
fid = fopen('dist.dat');
data = textscan(                            ...
        fid,                   '%s%s%f', ...
        'Delimiter',           ' ',      ...
        'MultipleDelimsAsOne', true      ...
);
fclose(fid);

% Find all unique sources
text_bodies = unique(reshape([data{1:2}],[],1));

% Iterate trough the records and complete similarity matrix
N = numel(text_bodies);
similarity = NaN(N,N);
for k = 1:size(data{1},1)
        n1 = find(strcmp(data{1}{k}, text_bodies));
        n2 = find(strcmp(data{2}{k}, text_bodies));

        similarity(n1, n2) = data{3}(k); % Symmetrical part ignored
end;

% Display #D bar chart
bar3(similarity);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM