简体   繁体   English

如何将字内句号保持在会标中? 定量

[英]How do I keep intra-word periods in unigrams? R quanteda

I would like to preserve two letter acronyms in my unigram frequency table that are separated by periods such as "tv" and "us". 我想在我的unigram频率表中保留两个字母首字母缩写词,它们用句点(例如“ tv”和“ us”)分隔。 When I build my unigram frequency table with quanteda, the teminating period is getting truncated. 当我用Quanteda构建单字组频率表时,终止周期被缩短了。 Here is a small test corpus to illustrate. 这是一个小的测试语料库来说明。 I have removed periods as sentence separators: 我已删除句点作为句子分隔符:

SOS This is the us where our politics is crazy EOS

SOS In the US we watch a lot of tv aka TV EOS

SOS TV is an important part of life in the US EOS

SOS folks outside the us probably don't watch so much tv EOS

SOS living in other countries is probably not any less crazy EOS

SOS i enjoy my sanity when it comes to visit EOS

which I load into R as character vector: 我将其作为字符向量加载到R中:

acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS")

Here is the code I use to build my unigram frequency table: 这是我用来构建单字组频率表的代码:

library(quanteda)
dat.dfm <- dfm(acro.test, ngrams=1, verbose=TRUE, concatenator=" ",  toLower=FALSE, removeNumbers=TRUE, removePunct=FALSE, stopwords=FALSE)
dat.mat <- as.data.frame(as.matrix(docfreq(dat.dfm)))
ng.sorted <- sort(rowSums(dat.mat), decreasing=TRUE)
freqTable <- data.frame(ngram=names(ng.sorted), frequency = ng.sorted)
row.names(freqTable) <- NULL
freqTable

This produces the following: 这将产生以下结果:

       ngram frequency
1        SOS         6
2        EOS         6
3        the         4
4         is         3
5          .         3
6        u.s         2
7      crazy         2
8         US         2
9      watch         2
10        of         2
11       t.v         2
12        TV         2
13        in         2
14  probably         2
15      This         1
16     where         1
17       our         1
18  politics         1
19        In         1
20        we         1
21         a         1
22       lot         1
23       aka         1

etc... 等等...

I would like to keep the terminal periods on tv and us as well as eliminate the entry in the table for . 我想在电视上和我们保持收尾时段,并在中删除表中的条目。 with a frequency of 3. 频率为3。

I also don't understand why the period (.) would have a count of 3 in this table while counting the us and tv unigrams correctly (2 each). 我也不明白为什么在正确计数us和tv字母组合(每个2个)的同时,句点(。)在此表中的计数为3。

The reason for this behaviour is that quanteda 's default word tokeniser uses the ICU-based definition for word boundaries (from the stringi package). 出现这种现象的原因是, Quanteda的默认单词标记器使用了基于ICU的单词边界定义(来自stringi包)。 us appears as the word us followed by a period . us ”一词出现在“ us后面,并带有句点. token. 令牌。 This is great if your name is will.i.am but maybe not so great for your purposes. 如果您的名字叫will.i.am,那就太好了,但对于您的目的而言可能就不太好了。 But you can easily switch to the white-space tokeniser, using the argument what = "fasterword" passed to tokens() , an option available in dfm() through the ... part of the function call. 但是,您可以使用传递到tokens()的参数what = "fasterword"轻松切换到空白标记器,该参数在函数调用的...部分中在dfm()可用。

tokens(acro.test, what = "fasterword")[[1]]
## [1] "SOS"      "This"     "is"       "the"      "u.s."     "where"    "our"      "politics" "is"       "crazy"    "EOS" 

You can see that here, us is preserved. 您可以在这里看到, us已保存。 In response to your last question , the terminal . 回答您的最后一个问题 ,终端. had a document frequency of 3 because it appeared in three documents as a separate token, which is the default word tokeniser behaviour when remove_punct = FALSE . 的文档频率为3,因为它作为单独的令牌出现在三个文档中,这是remove_punct = FALSE时的默认单词令牌生成器行为。

To pass this through to dfm() and then construct your data.frame of the document frequency of the words, the following code works (I've tidied it up a bit for efficiency). 要将其传递给dfm() ,然后构造单词的文档频率的data.frame,以下代码可以工作(为效率起见,我对其进行了整理)。 Note the comment about the difference between document and term frequency - I've noted that some users are a bit confused about docfreq() . 请注意有关文档和术语频率之间差异的评论-我注意到一些用户对docfreq()有点困惑。

# I removed the options that were the same as the default 
# note also that stopwords = TRUE is not a valid argument - see remove parameter
dat.dfm <- dfm(acro.test, tolower = FALSE, remove_punct = FALSE, what = "fasterword")

# sort in descending document frequency
dat.dfm <- dat.dfm[, names(sort(docfreq(dat.dfm), decreasing = TRUE))]
# Note: this would sort the dfm in descending total term frequency
#       not the same as docfreq
# dat.dfm <- sort(dat.dfm)

# this creates the data.frame in one more efficient step
freqTable <- data.frame(ngram = featnames(dat.dfm), frequency = docfreq(dat.dfm),
                        row.names = NULL, stringsAsFactors = FALSE)
head(freqTable, 10)
##    ngram frequency
## 1    SOS         6
## 2    EOS         6
## 3    the         4
## 4     is         3
## 5   u.s.         2
## 6  crazy         2
## 7     US         2
## 8  watch         2
## 9     of         2
## 10  t.v.         2

In my view the named vector produced by docfreq() on the dfm is a more efficient method for storing the results than your data.frame approach, but you may wish to add other variables. 在我看来,dfm上由docfreq()产生的命名向量是一种比data.frame方法更有效的存储结果的方法,但是您可能希望添加其他变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 R中的grepl:字内破折号阻止匹配 - grepl in R: matching impeded by intra-word dashes grepl在R中:尽管单词内有破折号,但仍存在虚假匹配 - grepl in R: spurious match despite intra-word dash 删除除撇号和R中的字内短划线之外的标点符号 - Removing punctuation except for apostrophes AND intra-word dashes in R 使用 R 中的 gsub 删除撇号和词内破折号以外的标点符号,而不会意外地连接两个单词 - Removing punctuation except for apostrophes AND intra-word dashes with gsub in R WITHOUT accidently concatenating two words 当连字符包围单个内部字符时,如何替换字内连字符 - How to substitute intra-word hyphens when hyphens surround a single inner character 保留单词内的星号,同时删除其余的 - preserve intra-word asterisks while removing the rest 用空格替换除词内破折号之外的标点符号 - Replacing punctuation except intra-word dashes with a space 如何从一元组中删除(自定义)停用词但将它们保留在二元组中? - How do I remove (custom) stopwords from unigrams but keep them in bigrams? tm软件包版本0.7在DocumentTermMatrix中不保留单词内的破折号 - tm package version 0.7 does not preserve intra-word dashes in DocumentTermMatrix 在R中通过三卦产生所有单词unigrams - Generating all word unigrams through trigrams in R
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM