简体   繁体   中英

Is there a way to concatenate specific bigrams within a corpus for topic modeling in R?

I am brand new to R (and this site) and am learning it for a very specific topic modeling project. I need to concatenate specific bigrams/trigrams within a body of text for topic modeling and have run into a few roadblocks. I need to do this because the unigram 'community' doesn't carry the weight that the bigrams 'community health' or 'community engagement' would have on the topics.

I have successfully found bigrams/trigrams and identified those that I want to consider in the topic modeling process. I do not want to unite every bigram because phrases like "spend time" or "small group" are not relevant to the project.

Is there a way to either remove space between or add a symbol between the individual words in a bigram directly into the corpus that will adjust the term counts/term frequency? Although stri_replace_all_regex() has successfully eliminated the space between bigrams, when I unnest_tokens() the terms within the corpus remain mostly unchanged. There were 121 instances of 'community' before using stri_replace_all_regex() , and after using stri_replace_all_regex() there are still 121 instances of 'community' and ALSO 5 instances of 'communityengagement'. It appears that this is affecting the LDA topics as the topic models remain almost unchanged before and after uniting the bigrams. Similarly, I have tried using txt_recode_ngram() from udpipe with no success.

My code is below. Any help is greatly appreciated!

# started with a pre-processed and cleaned corpus
typeof(corpusG)
[1] "character"

# create corpus & matrix objects
VCorpusG <- VCorpus(VectorSource(corpusG))
DTMG <- DocumentTermMatrix(VCorpusG)
tidyG <- tidy(DTMG)

# observe term counts before combining bigrams
beforeGcounts <- tidyG %>%
count(term, sort = TRUE)

beforeGcounts contains 6,962 entries , 2 total columns. The top 10 terms are as follows:

term n
community 121
include 121
build 120
work 120
park 119
space 119
policy 118
change 117
city 117
need 116
# create bigrams 
G_bigrams <- tidyG %>%
  + unnest_tokens(bigram, text, token = "ngrams", n = 2)
count_G_bigrams <- G_bigrams %>%
  count(bigram, sort = TRUE)
export(count_G_bigrams, "Bigrams.G.xlsx")

# combine selected bigram word1+word2 within corpus 
corpusG <- stri_replace_all_regex(corpusG, pattern=c("green space", "community member", "open space", "low income", "environmental justice", "park equity", "policy change", "decision make", "community engagement", "system change", "community base", "power build", "policy advocacy", "park recreation", "community color", "public health", "land use", "people color", "access park", "climate change", "community lead", "root cause", "community leader", "decision maker", "community need", "leadership development", "affordable house", "african american", "community development", "quality life", "build power", "community drive", "community garden", "civic engagement", "community health", "elect official", "non profit", "city council", "green infrastructure", "build community", "community resident", "economic development", "air quality", "mental health", "engage community", "urban community", "park access", "underserved community", "equitable access", "marginalize community", "community build", "heat island", "tree canopy"), replacement=c("greenspace", "communitymember", "openspace", "lowincome", "environmentaljustice", "parkequity", "policychange", "decisionmake", "communityengagement", "systemchange", "communitybase", "powerbuild", "policyadvocacy", "parkrecreation", "communitycolor", "publichealth", "landuse", "peoplecolor", "accesspark", "climatechange", "communitylead", "rootcause", "communityleader", "decisionmaker", "communityneed", "leadershipdevelopment", "affordablehouse", "africanamerican", "communitydevelopment", "qualitylife", "buildpower", "communitydrive", "communitygarden", "civicengagement", "communityhealth", "electofficial", "nonprofit", "citycouncil", "greeninfrastructure", "buildcommunity", "communityresident", "economicdevelopment", "airquality", "mentalhealth", "engagecommunity", "urbancommunity", "parkaccess", "underservedcommunity", "equitableaccess", "marginalizecommunity", "communitybuild", "heatisland", "treecanopy"), vectorize=FALSE)

# recreate corpus and matrix objects
VCorpusG <- VCorpus(VectorSource(corpusG))
DTMG <- DocumentTermMatrix(VCorpusG)
tidyG <- tidy(DTMG)

# find term counts after combining bigrams
afterGcounts <- tidyG %>%
count(term, sort = TRUE)

afterGcounts contains 7045 entries , 2 total columns. The top 10 terms are identical to beforeGcounts .

Although I understand that the bigrams would be added as individual rows to these counts, shouldn't the number of occurances for unigram terms like 'community' go down because some of those singular instances are now different words?

It appears that Gensim in Python offers what I am trying to achieve. Are there any R packages that can do something similar?

This is how to achieve recoding into n-grams using udpipe and extracting a document-term matrix

library(udpipe)
library(data.table)
x <- data.frame(doc_id = c("doc1", "doc2"),
                text = c("My space is green, do you mean green space or other space, like open space.", 
                         "I'm into park equity for low income persons"))


mwe <- c("green space", "community member", "open space", "low income", "environmental justice", "park equity", "policy change", "decision make", "community engagement", "system change", "community base", "power build", "policy advocacy", "park recreation", "community color", "public health", "land use", "people color", "access park", "climate change", "community lead", "root cause", "community leader", "decision maker", "community need", "leadership development", "affordable house", "african american", "community development", "quality life", "build power", "community drive", "community garden", "civic engagement", "community health", "elect official", "non profit", "city council", "green infrastructure", "build community", "community resident", "economic development", "air quality", "mental health", "engage community", "urban community", "park access", "underserved community", "equitable access", "marginalize community", "community build", "heat island", "tree canopy")
mwe <- data.frame(text = mwe, ngram = sapply(strsplit(mwe, split = " "), FUN = length))

anno <- udpipe(x, "english-ewt")
anno <- setDT(anno)
anno <- anno[, lemma_ngram := txt_recode_ngram(x = lemma, compound = mwe$text, ngram = mwe$ngram, sep = " "), by = list(doc_id, paragraph_id, sentence_id)]
anno[, c("doc_id", "sentence_id", "token", "lemma", "lemma_ngram", "upos")]
#>     doc_id sentence_id   token  lemma lemma_ngram  upos
#>  1:   doc1           1      My     my          my  PRON
#>  2:   doc1           1   space  space       space  NOUN
#>  3:   doc1           1      is     be          be   AUX
#>  4:   doc1           1   green  green       green   ADJ
#>  5:   doc1           1       ,      ,           , PUNCT
#>  6:   doc1           1      do     do          do   AUX
#>  7:   doc1           1     you    you         you  PRON
#>  8:   doc1           1    mean   mean        mean  VERB
#>  9:   doc1           1   green  green green space   ADJ
#> 10:   doc1           1   space  space        <NA>  NOUN
#> 11:   doc1           1      or     or          or CCONJ
#> 12:   doc1           1   other  other       other   ADJ
#> 13:   doc1           1   space  space       space  NOUN
#> 14:   doc1           1       ,      ,           , PUNCT
#> 15:   doc1           1    like   like        like   ADP
#> 16:   doc1           1    open   open  open space   ADJ
#> 17:   doc1           1   space  space        <NA>  NOUN
#> 18:   doc1           1       .      .           . PUNCT
#> 19:   doc2           1       I      I           I  PRON
#> 20:   doc2           1      'm     be          be   AUX
#> 21:   doc2           1    into   into        into   ADP
#> 22:   doc2           1    park   park park equity  NOUN
#> 23:   doc2           1  equity equity        <NA>  NOUN
#> 24:   doc2           1     for    for         for   ADP
#> 25:   doc2           1     low    low  low income   ADJ
#> 26:   doc2           1  income income        <NA>  NOUN
#> 27:   doc2           1 persons person      person  NOUN
#>     doc_id sentence_id   token  lemma lemma_ngram  upos
dtm <- document_term_frequencies(anno[, c("doc_id", "lemma_ngram")])
dtm
#>     doc_id        term freq
#>  1:   doc1          my    1
#>  2:   doc1       space    2
#>  3:   doc1          be    1
#>  4:   doc1       green    1
#>  5:   doc1           ,    2
#>  6:   doc1          do    1
#>  7:   doc1         you    1
#>  8:   doc1        mean    1
#>  9:   doc1 green space    1
#> 10:   doc1          or    1
#> 11:   doc1       other    1
#> 12:   doc1        like    1
#> 13:   doc1  open space    1
#> 14:   doc1           .    1
#> 15:   doc2           I    1
#> 16:   doc2          be    1
#> 17:   doc2        into    1
#> 18:   doc2 park equity    1
#> 19:   doc2         for    1
#> 20:   doc2  low income    1
#> 21:   doc2      person    1
#>     doc_id        term freq
dtm <- document_term_matrix(dtm)
library(topicmodels)
LDA(dtm, ...)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM