I am brand new to R (and this site) and am learning it for a very specific topic modeling project. I need to concatenate specific bigrams/trigrams within a body of text for topic modeling and have run into a few roadblocks. I need to do this because the unigram 'community' doesn't carry the weight that the bigrams 'community health' or 'community engagement' would have on the topics.
I have successfully found bigrams/trigrams and identified those that I want to consider in the topic modeling process. I do not want to unite every bigram because phrases like "spend time" or "small group" are not relevant to the project.
Is there a way to either remove space between or add a symbol between the individual words in a bigram directly into the corpus that will adjust the term counts/term frequency? Although stri_replace_all_regex()
has successfully eliminated the space between bigrams, when I unnest_tokens()
the terms within the corpus remain mostly unchanged. There were 121 instances of 'community' before using stri_replace_all_regex()
, and after using stri_replace_all_regex()
there are still 121 instances of 'community' and ALSO 5 instances of 'communityengagement'. It appears that this is affecting the LDA topics as the topic models remain almost unchanged before and after uniting the bigrams. Similarly, I have tried using txt_recode_ngram()
from udpipe with no success.
My code is below. Any help is greatly appreciated!
# started with a pre-processed and cleaned corpus
typeof(corpusG)
[1] "character"
# create corpus & matrix objects
VCorpusG <- VCorpus(VectorSource(corpusG))
DTMG <- DocumentTermMatrix(VCorpusG)
tidyG <- tidy(DTMG)
# observe term counts before combining bigrams
beforeGcounts <- tidyG %>%
count(term, sort = TRUE)
beforeGcounts
contains 6,962 entries , 2 total columns. The top 10 terms are as follows:
term | n |
---|---|
community | 121 |
include | 121 |
build | 120 |
work | 120 |
park | 119 |
space | 119 |
policy | 118 |
change | 117 |
city | 117 |
need | 116 |
# create bigrams
G_bigrams <- tidyG %>%
+ unnest_tokens(bigram, text, token = "ngrams", n = 2)
count_G_bigrams <- G_bigrams %>%
count(bigram, sort = TRUE)
export(count_G_bigrams, "Bigrams.G.xlsx")
# combine selected bigram word1+word2 within corpus
corpusG <- stri_replace_all_regex(corpusG, pattern=c("green space", "community member", "open space", "low income", "environmental justice", "park equity", "policy change", "decision make", "community engagement", "system change", "community base", "power build", "policy advocacy", "park recreation", "community color", "public health", "land use", "people color", "access park", "climate change", "community lead", "root cause", "community leader", "decision maker", "community need", "leadership development", "affordable house", "african american", "community development", "quality life", "build power", "community drive", "community garden", "civic engagement", "community health", "elect official", "non profit", "city council", "green infrastructure", "build community", "community resident", "economic development", "air quality", "mental health", "engage community", "urban community", "park access", "underserved community", "equitable access", "marginalize community", "community build", "heat island", "tree canopy"), replacement=c("greenspace", "communitymember", "openspace", "lowincome", "environmentaljustice", "parkequity", "policychange", "decisionmake", "communityengagement", "systemchange", "communitybase", "powerbuild", "policyadvocacy", "parkrecreation", "communitycolor", "publichealth", "landuse", "peoplecolor", "accesspark", "climatechange", "communitylead", "rootcause", "communityleader", "decisionmaker", "communityneed", "leadershipdevelopment", "affordablehouse", "africanamerican", "communitydevelopment", "qualitylife", "buildpower", "communitydrive", "communitygarden", "civicengagement", "communityhealth", "electofficial", "nonprofit", "citycouncil", "greeninfrastructure", "buildcommunity", "communityresident", "economicdevelopment", "airquality", "mentalhealth", "engagecommunity", "urbancommunity", "parkaccess", "underservedcommunity", "equitableaccess", "marginalizecommunity", "communitybuild", "heatisland", "treecanopy"), vectorize=FALSE)
# recreate corpus and matrix objects
VCorpusG <- VCorpus(VectorSource(corpusG))
DTMG <- DocumentTermMatrix(VCorpusG)
tidyG <- tidy(DTMG)
# find term counts after combining bigrams
afterGcounts <- tidyG %>%
count(term, sort = TRUE)
afterGcounts
contains 7045 entries , 2 total columns. The top 10 terms are identical to beforeGcounts
.
Although I understand that the bigrams would be added as individual rows to these counts, shouldn't the number of occurances for unigram terms like 'community' go down because some of those singular instances are now different words?
It appears that Gensim in Python offers what I am trying to achieve. Are there any R packages that can do something similar?
This is how to achieve recoding into n-grams using udpipe and extracting a document-term matrix
library(udpipe)
library(data.table)
x <- data.frame(doc_id = c("doc1", "doc2"),
text = c("My space is green, do you mean green space or other space, like open space.",
"I'm into park equity for low income persons"))
mwe <- c("green space", "community member", "open space", "low income", "environmental justice", "park equity", "policy change", "decision make", "community engagement", "system change", "community base", "power build", "policy advocacy", "park recreation", "community color", "public health", "land use", "people color", "access park", "climate change", "community lead", "root cause", "community leader", "decision maker", "community need", "leadership development", "affordable house", "african american", "community development", "quality life", "build power", "community drive", "community garden", "civic engagement", "community health", "elect official", "non profit", "city council", "green infrastructure", "build community", "community resident", "economic development", "air quality", "mental health", "engage community", "urban community", "park access", "underserved community", "equitable access", "marginalize community", "community build", "heat island", "tree canopy")
mwe <- data.frame(text = mwe, ngram = sapply(strsplit(mwe, split = " "), FUN = length))
anno <- udpipe(x, "english-ewt")
anno <- setDT(anno)
anno <- anno[, lemma_ngram := txt_recode_ngram(x = lemma, compound = mwe$text, ngram = mwe$ngram, sep = " "), by = list(doc_id, paragraph_id, sentence_id)]
anno[, c("doc_id", "sentence_id", "token", "lemma", "lemma_ngram", "upos")]
#> doc_id sentence_id token lemma lemma_ngram upos
#> 1: doc1 1 My my my PRON
#> 2: doc1 1 space space space NOUN
#> 3: doc1 1 is be be AUX
#> 4: doc1 1 green green green ADJ
#> 5: doc1 1 , , , PUNCT
#> 6: doc1 1 do do do AUX
#> 7: doc1 1 you you you PRON
#> 8: doc1 1 mean mean mean VERB
#> 9: doc1 1 green green green space ADJ
#> 10: doc1 1 space space <NA> NOUN
#> 11: doc1 1 or or or CCONJ
#> 12: doc1 1 other other other ADJ
#> 13: doc1 1 space space space NOUN
#> 14: doc1 1 , , , PUNCT
#> 15: doc1 1 like like like ADP
#> 16: doc1 1 open open open space ADJ
#> 17: doc1 1 space space <NA> NOUN
#> 18: doc1 1 . . . PUNCT
#> 19: doc2 1 I I I PRON
#> 20: doc2 1 'm be be AUX
#> 21: doc2 1 into into into ADP
#> 22: doc2 1 park park park equity NOUN
#> 23: doc2 1 equity equity <NA> NOUN
#> 24: doc2 1 for for for ADP
#> 25: doc2 1 low low low income ADJ
#> 26: doc2 1 income income <NA> NOUN
#> 27: doc2 1 persons person person NOUN
#> doc_id sentence_id token lemma lemma_ngram upos
dtm <- document_term_frequencies(anno[, c("doc_id", "lemma_ngram")])
dtm
#> doc_id term freq
#> 1: doc1 my 1
#> 2: doc1 space 2
#> 3: doc1 be 1
#> 4: doc1 green 1
#> 5: doc1 , 2
#> 6: doc1 do 1
#> 7: doc1 you 1
#> 8: doc1 mean 1
#> 9: doc1 green space 1
#> 10: doc1 or 1
#> 11: doc1 other 1
#> 12: doc1 like 1
#> 13: doc1 open space 1
#> 14: doc1 . 1
#> 15: doc2 I 1
#> 16: doc2 be 1
#> 17: doc2 into 1
#> 18: doc2 park equity 1
#> 19: doc2 for 1
#> 20: doc2 low income 1
#> 21: doc2 person 1
#> doc_id term freq
dtm <- document_term_matrix(dtm)
library(topicmodels)
LDA(dtm, ...)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.