计算每年的代币数量

Question

I wrote a small R script.我写了一个小的 R 脚本。 Input are text files (thousands of journal articles).输入是文本文件（数千篇期刊文章）。 I generated the metadata (including the publication year) from the file names.我从文件名生成了元数据（包括出版年份）。 Now I want to calculate the total number of tokens per year.现在我想计算每年的代币总数。 However, I am not getting anywhere here.但是，我在这里一无所获。

# Metadata from filenames
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", 
                        docvarnames = c("Unit", "Year", "Volume", "Issue")) 
# we add some more metadata columns to the data frame
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Corpus
SPARA_corp <- corpus(rawdata_SPARA)

Does anyone here know a solution?这里有人知道解决方案吗？

I used tokens_by function of the quanteda package which seems to be outdated.我使用了 quanteda 包的 tokens_by 函数，它似乎已经过时了。

Answer 1

Thanks.谢谢。 I could not get your script to work: But it inspired me to develop an alternative solution:我无法让您的脚本工作：但它激发了我开发替代解决方案的灵感：

# Load the necessary libraries
library(readtext)
library(dplyr)
library(quanteda)

# Set the directory containing the text files
dir <- "/Textfiles/SPARA_paragraphs"

# Read in the text files using the readtext function
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", docvarnames = c("Unit", "Year", "Volume", "Issue"))

# Extract the year from the file name
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)

# Group the data by year and summarize by tokens
rawdata_SPARA_grouped <- rawdata_SPARA %>% 
    group_by(Year) %>% 
    summarize(tokens = sum(ntoken(text)))

# Print number of absolute tokens per year

print(rawdata_SPARA_grouped)

Answer 2

You do not need to substring substr(rawdata_SPARA$Year, 0, 4) .您不需要子字符串substr(rawdata_SPARA$Year, 0, 4) 。 While calling readtext function, it extracts the year from the file name.在调用readtext函数时，它从文件名中提取年份。 In the example below the file names have the structure like EU_euro_2004_de_PSE.txt and automatically 2004 will be inserted into readtext object.在下面的示例中，文件名的结构类似于EU_euro_2004_de_PSE.txt ， 2004将自动插入到readtext对象中。 As it inherits from data.frame you can use standard data manipulation functions, eg from dplyr package.因为它继承自data.frame ，您可以使用标准数据操作函数，例如来自dplyr包。

Then just group_by by year and summarize by tokens.然后按年份group_by并按标记summarize 。 Number of tokens was calculated by quanteda s ntoken function.令牌数量由quanteda的ntoken函数计算。

See the code below:请参阅下面的代码：

library(readtext)
library(quanteda)

# Prepare sample corpus
set.seed(123)
DATA_DIR <- system.file("extdata/", package = "readtext")
rt <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
                 docvarsfrom = "filenames",
                 docvarnames = c("unit", "context", "year", "language", "party"),
                 encoding = "LATIN1")
rt$year = sample(2005:2007, nrow(rt), replace = TRUE)


# Calculate tokens
rt$tokens <- ntoken(corpus(rt), remove_punct = TRUE)

# Find distribution by year
rt %>% group_by(year) %>% summarize(total_tokens = sum(tokens))

Output:输出：

# A tibble: 3 × 2
   year total_tokens
  <int>        <int>
1  2005         5681
2  2006        26564
3  2007        24119

计算每年的代币数量

问题描述

2 个解决方案

解决方案1
1 2022-12-13 14:50:39

解决方案2
0 已采纳 2022-12-12 10:38:25

计算每年的代币数量

问题描述

2 个解决方案

解决方案1 1 2022-12-13 14:50:39

解决方案2 0 已采纳 2022-12-12 10:38:25

解决方案1
1 2022-12-13 14:50:39

解决方案2
0 已采纳 2022-12-12 10:38:25