简体   繁体   中英

Computing frequency of words for each document in a corpus/DFM for R

I want to replicate a measure of common words from a Paper in R.

They describe their procedure as follows: "To construct Common words,..., we first determine the relative frequency of all words occurring in all documents. We then calculate Common words as the average of this proportion for every word occurring in a given document. The higher the value of common words, the more ordinary is the documents's language and thus the more readable it should be." (Loughran & McDonald 2014)

Can anybody help me with this? I work with corpus objects in order to make analysis with the text documents in R.

I have already computed the relative frequency of all words occurring in all documents as follows:

dfm_Notes_Summary <- dfm(tokens_Notes_Summary)
Summary_FreqStats_Notes <- textstat_frequency(dfm_Notes_Summary)

Summary_FreqStats_Notes$RelativeFreq <- Summary_FreqStats_Notes$frequency/sum(Summary_FreqStats_Notes$frequency)

-> I basically transformed the tokens object (tokens_Notes_Summary) into an dfm Object (dfm_Notes_Summary) and got the relative frequency of all words in all documents.

Now I struggle to calculate the average of this proportion for every word occurring in a given document.

I reread Loughran and McDonald (2014) meant, since I could not find code for that, but I think it's based on the average of a document's terms' document frequencies. The code will probably make this more clear:

library("quanteda")
#> Package version: 3.2.3
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 10 of 10 threads used.
#> See https://quanteda.io for tutorials and examples.

dfmat <- data_corpus_inaugural |>
    head(5) |>
    tokens(remove_punct = TRUE, remove_numbers = TRUE) |>
    dfm()

readability_commonwords <- function(x) {
    # compute document frequencies of all features
    relative_docfreq <- docfreq(x) / nfeat(x)
    # average of all words by the relative document frequency
    result <- x %*% relative_docfreq
    # return as a named vector
    structure(result[, 1], names = rownames(result))
}

readability_commonwords(dfmat)
#> 1789-Washington 1793-Washington      1797-Adams  1801-Jefferson  1805-Jefferson 
#>       2.6090768       0.2738525       4.2026818       3.0928314       3.8256833

To know full details though you should ask the authors.

Created on 2022-11-30 with reprex v2.0.2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM