[英]Count number of tokens per year
I wrote a small R script.我写了一个小的 R 脚本。 Input are text files (thousands of journal articles).
输入是文本文件(数千篇期刊文章)。 I generated the metadata (including the publication year) from the file names.
我从文件名生成了元数据(包括出版年份)。 Now I want to calculate the total number of tokens per year.
现在我想计算每年的代币总数。 However, I am not getting anywhere here.
但是,我在这里一无所获。
# Metadata from filenames
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_",
docvarnames = c("Unit", "Year", "Volume", "Issue"))
# we add some more metadata columns to the data frame
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Corpus
SPARA_corp <- corpus(rawdata_SPARA)
Does anyone here know a solution?这里有人知道解决方案吗?
I used tokens_by function of the quanteda package which seems to be outdated.我使用了 quanteda 包的 tokens_by 函数,它似乎已经过时了。
Thanks.谢谢。 I could not get your script to work: But it inspired me to develop an alternative solution:
我无法让您的脚本工作:但它激发了我开发替代解决方案的灵感:
# Load the necessary libraries
library(readtext)
library(dplyr)
library(quanteda)
# Set the directory containing the text files
dir <- "/Textfiles/SPARA_paragraphs"
# Read in the text files using the readtext function
rawdata_SPARA <- readtext("SPARA_paragraphs/*.txt", docvarsfrom = "filenames", dvsep="_", docvarnames = c("Unit", "Year", "Volume", "Issue"))
# Extract the year from the file name
rawdata_SPARA$Year <- substr(rawdata_SPARA$Year, 0, 4)
# Group the data by year and summarize by tokens
rawdata_SPARA_grouped <- rawdata_SPARA %>%
group_by(Year) %>%
summarize(tokens = sum(ntoken(text)))
# Print number of absolute tokens per year
print(rawdata_SPARA_grouped)
You do not need to substring substr(rawdata_SPARA$Year, 0, 4)
.您不需要子字符串
substr(rawdata_SPARA$Year, 0, 4)
。 While calling readtext
function, it extracts the year from the file name.在调用
readtext
函数时,它从文件名中提取年份。 In the example below the file names have the structure like EU_euro_2004_de_PSE.txt
and automatically 2004
will be inserted into readtext
object.在下面的示例中,文件名的结构类似于
EU_euro_2004_de_PSE.txt
, 2004
将自动插入到readtext
对象中。 As it inherits from data.frame
you can use standard data manipulation functions, eg from dplyr
package.因为它继承自
data.frame
,您可以使用标准数据操作函数,例如来自dplyr
包。
Then just group_by
by year and summarize
by tokens.然后按年份
group_by
并按标记summarize
。 Number of tokens was calculated by quanteda
s ntoken
function.令牌数量由
quanteda
的ntoken
函数计算。
See the code below:请参阅下面的代码:
library(readtext)
library(quanteda)
# Prepare sample corpus
set.seed(123)
DATA_DIR <- system.file("extdata/", package = "readtext")
rt <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"),
docvarsfrom = "filenames",
docvarnames = c("unit", "context", "year", "language", "party"),
encoding = "LATIN1")
rt$year = sample(2005:2007, nrow(rt), replace = TRUE)
# Calculate tokens
rt$tokens <- ntoken(corpus(rt), remove_punct = TRUE)
# Find distribution by year
rt %>% group_by(year) %>% summarize(total_tokens = sum(tokens))
Output:输出:
# A tibble: 3 × 2
year total_tokens
<int> <int>
1 2005 5681
2 2006 26564
3 2007 24119
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.