简体   繁体   English

R中的字频散点图(单词作为标签)

[英]word frequency scatterplot in R (words as labels)

I'm currently working on a paper comparing British MPs' roles in Parliament and their roles on twitter. 我目前正在撰写一篇论文,比较英国国会议员在议会中的角色及其在推特上的角色。 I have collected twitter data (most importantly, the raw text) and speeches in Parliament from one MP and wish to do a scatterplot showing which words are common in both twitter and Parliament (top right hand corner) and which ones are not (bottom left hand corner). 我收集了国会议员的推特数据(最重要的是原始文本)和演讲,并希望做一个散点图,显示在推特和议会(右上角)中哪些词是常见的,哪些不是(左下角)手角)。 So, x-axis is word frequency in parliament, y-axis is word frequency on twitter. 因此,x轴是议会中的词频,y轴是twitter上的词频。

So far, I have done all the work on this paper with R. I have ZERO experience with R, up until now I've only worked with STATA. 到目前为止,我已经用R完成了本文的所有工作。我对R有ZERO经验,到目前为止我只使用过STATA。

I tried adapting this code ( http://is-r.tumblr.com/post/37975717466/text-analysis-made-too-easy-with-the-tm-package ), but I just can't work it out. 我尝试调整这段代码( http://is-r.tumblr.com/post/37975717466/text-analysis-made-too-easy-with-the-tm-package ),但我无法解决这个问题。 。 The main problem is that the person who wrote this code uses one text document and regular expressions to demarcate which text belongs on which axis. 主要问题是编写此代码的人使用一个文本文档和正则表达式来划分哪个文本属于哪个轴。 I however have two separate documents (I have saved them as .txt, corpi, or term-document-matrices) which should correspond to the separate axis. 然而,我有两个单独的文档(我已将它们保存为.txt,corpi或term-document-matrices),这些文档应对应于单独的轴。

I'm sorry that a novice such as myself is bothering you with this, and I will devote more time this year to learning the basics of R so that I could solve this problem by myself. 对不起,像我这样的新手正在困扰着你,今年我将花更多的时间来学习R的基础知识,这样我就可以自己解决这个问题了。 However, this paper is due next Monday and I simply can't do so much backtracking right now to solve the problem. 但是,本文将于下周一发布,我现在根本无法做太多的回溯来解决问题。

I would be really grateful if you could help me, 如果你能帮助我,我将非常感激,

thanks very much, 非常感谢,

Nik

EDIT: I'll put in the code that I've made, even though it's not quite in the right direction, but that way I can offer a proper example of what I'm dealing with. 编辑:我会输入我已经制作的代码,即使它不是在正确的方向,但这样我可以提供一个正确的例子来处理我正在处理的问题。

I have tried implementing is.R()s approach by using the text in question in a csv file, with a dummy variable to classify whether it is twitter text or speech text. 我尝试通过在csv文件中使用有问题的文本来实现is.R()方法,使用虚拟变量来分类它是推文文本还是语音文本。 i follow the approach, and at the end i even get a scatterplot, however, it plots the number ( i think it is the number at which the word is located in the dataset??) rather than the word. 我遵循这个方法,最后我甚至得到一个散点图,但是,它绘制了数字(我认为这是单词在数据集中的位数?)而不是单词。 i think the problem might be that R is handling every line in the csv file as a seperate text document. 我认为问题可能是R将csv文件中的每一行都作为单独的文本文档处理。

# in excel i built a csv dataset that contains all the text, each instance (single tweet / speech) in one line, with an added dummy variable that clarifies whether the text is a tweet or a speech ("istweet", 1=twitter). 

comparison_watson.df <- read.csv(file="data/watson_combo.csv", stringsAsFactors = FALSE)

# now to make a text corpus out of the data frame

comparison_watson_corpus <- Corpus(DataframeSource(comparison_watson.df))
inspect(comparison_watson_corpus)

# now to make a term-document-matrix

comparison_watson_tdm <-TermDocumentMatrix(comparison_watson_corpus)
inspect(comparison_watson_tdm)

comparison_watson_tdm <- inspect(comparison_watson_tdm)
sort(colSums(comparison_watson_tdm))
table(colSums(comparison_watson_tdm))

termCountFrame_watson <- data.frame(Term = rownames(comparison_watson_tdm))
termCountFrame_watson$twitter <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 1, ])
termCountFrame_watson$speech <- colSums(comparison_watson_tdm[comparison_watson.df$istwitter == 0, ])

head(termCountFrame_watson)

zp1 <- ggplot(termCountFrame_watson)
zp1 <- zp1 + geom_text(aes(x = twitter, y = speech, label = Term))
print(zp1)
library(tm)
txts <- c(twitter="bla bla bla blah blah blub",
          speech="bla bla bla bla bla bla blub blub")
corp <- Corpus(VectorSource(txts))
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
colnames(term.matrix) <- names(txts)
term.matrix <- as.data.frame(term.matrix)

library(ggplot2)
ggplot(term.matrix, 
       aes_string(x=names(txts)[1], 
                  y=names(txts)[2], 
                  label="rownames(term.matrix)")) + 
  geom_text()

You might also want to try out these two buddies: 您可能还想尝试这两个好友:

library(wordcloud)
comparison.cloud(term.matrix)
commonality.cloud(term.matrix)

在此输入图像描述

You are not posting a reproducible example so I cannot give you code but only pinpoint you to resources. 您没有发布可重现的示例,因此我无法为您提供代码,只能指出您的资源。 Text scraping and processing is a bit difficult with R, but there are many guides. 使用R进行文本抓取和处理有点困难,但有很多指南。 Check this and this . 检查这个这个 In the last steps you can get word counts. 在最后的步骤中,您可以获得字数。

In the example from One R Tip A Day you get the word list at d$word and the word frequency at d$freq One R Tip A Day的例子中,你得到d$word的单词列表和d$freq的单词频率

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM