简体   繁体   English

Quanteda:我如何创建语料库并绘制单词的散布图?

[英]Quanteda: How do I create a corpus and plot dispersion of words?

I have some data which looks like this:我有一些看起来像这样的数据:

  date      signs  horoscope                                                      newspaper   
  <chr>     <chr>  <chr>                                                          <chr>       
1 06-06-20~ ARIES  Your week falls neatly into distinct phases. The completion o~ Indian Expr~
2 06-06-20~ TAURUS You're coming to the end of an emotional period, when you've ~ Indian Expr~
3 06-06-20~ GEMINI Passions are still running high, and the degree of emotional ~ Times of In~
4 06-06-20~ CANCER First things first - don't rush it! There is still a great de~ Indian Expr~
5 06-06-20~ LEO    The greatest pressures are coming from all directions at once~ Indian Expr~

I would like to create a corpus out of this data where all horoscope are grouped together by newspaper and signs as documents.我想从这些数据中创建一个语料库,其中所有horoscopenewspapersigns组合在一起作为文件。

For example, all ARIES in the newspaper Times of India should be one document, but arranged chronologically in order of date (their index should be ordered by date).例如, Times of India报纸上的所有ARIES应该是一个文档,但按时间顺序排列(它们的索引应按日期排序)。

Since I don't know how to group this text by newspaper and signs , I tried creating two different corpuses for each newspaper.由于我不知道如何按newspapersigns对文本进行分组,因此我尝试为每份报纸创建两个不同的语料库。 I have tried doing this:我试过这样做:


# Create a dataframe of only Times of India text
h_toi <- horoscopes %>%
  filter(newspaper == "Times of India") %>%
  select(-c("newspaper"))
  
# Create a corpus of out this
horo_corp_toi <- corpus(h_toi, text_field = "horoscope")

# Create docids
docids <- paste(h_toi$signs)

# Use this as docnames
docnames(horo_corp_toi) <- docids

head(docnames(horo_corp_toi), 5)
# [1] "ARIES.1"  "TAURUS.1" "GEMINI.1" "CANCER.1" "LEO.1" 

But as you can see, the docnames for the corpus are "ARIES.1" , `"TAURUS.1" and so on.但是正如您所看到的,语料库的docnames"ARIES.1" 、`"TAURUS.1" 等等。 This is a problem since when I try to plot it using quanteda's textplot_xray() , there are thousands of documents plotted instead of just 12 documents for each sign:这是一个问题,因为当我尝试使用 quanteda 的textplot_xray()绘制它时,绘制了数千个文档,而不是每个符号只有 12 个文档:

# Plot lexical dispersion of love in all signs 
kwic(tokens(horo_corp_toi), pattern = "love") %>%
    textplot_xray()

在此处输入图片说明

Instead, I would like to be able to do something like this:相反,我希望能够做这样的事情: 在此处输入图片说明

I am not able to get this visualization because I don't know how to manipulate and create the corpus initially.我无法获得此可视化,因为我最初不知道如何操作和创建语料库。 How can I do this, and what am I doing wrong?我该怎么做,我做错了什么?

Sample DPUT is here示例 DPUT 在这里

Since the question asks how to group by both sign and newspaper, let me answer that one first.既然问题问的是如何按标志和报纸分组,那么我先回答一下。

library("quanteda")
## Package version: 3.1.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textplots")

## horoscopes <- [per linked dput in OP]

corp <- corpus(horoscopes, text_field = "horoscope")
toks <- tokens(corp)

# grouped by sign and newspaper
tokens_group(toks, groups = interaction(signs, newspaper)) %>%
  kwic(pattern = "love") %>%
  textplot_xray()

To achieve the result output above (only the last image is shown here), you can loop through the newspapers and group only by signs .要实现上面的结果输出(此处仅显示最后一张图像),您可以遍历报纸并仅按signs分组。 Note that the number of signs here is limited because in the sample data provided, not all of the zodiac range was included in the data.请注意,此处的星座数量有限,因为在提供的样本数据中,并非所有生肖范围都包含在数据中。

# separate kwic for each newspaper
for (i in unique(toks$newspaper)) {
  thiskwic <- toks %>%
    tokens_subset(newspaper == i) %>%
    tokens_group(signs) %>%
    kwic(pattern = "love")
  textplot_xray(thiskwic) +
    ggplot2::ggtitle(paste("Lexical dispersion plot -", toupper(i)))
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM