按列计数POS标签

Question

I am trying to count all Part-Of-Speech tags in a row and sum it up. 我试图连续计算所有词性标签并将其汇总。

By now I reached two outputs: 到目前为止，我达到了两个输出：

1) The/DT question/NN was/VBD ,/, what/WP are/VBP you/PRP going/VBG to/TO cut/VB ?/. 1）/ DT问题/ NN是/ VBD，/，什么/ WP是/ VBP你/ PRP去/ VBG到/ TO切割/ VB？/。

2) c("DT", "NN", "VBD", ",", "WP", "VBP", "PRP", "VBG", "TO", "VB", ".") 2）c（“ DT”，“ NN”，“ VBD”，“，”，“ WP”，“ VBP”，“ PRP”，“ VBG”，“ TO”，“ VB”，“。”）

In this particular example desirable output is: 在此特定示例中，理想的输出是：

        DT  NN  VBD  WP  VBP  PRP   VBG   TO   VB
1 doc   1   1    1   1    1    1     1     1    1

But since I want to create it for the whole column in dataframe I want to see there 0 values as well in a columns, which corresponds to a POS tag which was not used in this sentence. 但是由于我想为数据帧中的整个列创建它，因此我想在列中也看到0值，这对应于此语句中未使用的POS标签。

Example: 例：

1 doc = "The/DT question/NN was/VBD ,/, what/WP are/VBP you/PRP going/VBG to/TO cut/VB ?/" 

2 doc = "Response/NN ?/."

Output: 输出：

        DT  NN  VBD  WP  VBP  PRP   VBG   TO   VB
1 doc   1   1    1   1    1    1     1     1    1
2 doc   0   1    0   0    0    0     0     0    0

What I did by now: 我现在所做的是：

library(stringr)
#Spliting into sentence based on carriage return

s <- unlist(lapply(df$sentence, function(x) { str_split(x, "\n")     }))

library(NLP)
library(openNLP)

tagPOS <-  function(x, ...) {
s <- as.String(x)
word_token_annotator <- Maxent_Word_Token_Annotator()
a2 <- Annotation(1L, "sentence", 1L, nchar(s))
a2 <- annotate(s, word_token_annotator, a2)
a3 <- annotate(s, Maxent_POS_Tag_Annotator(), a2)
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
list(POStagged = POStagged, POStags = POStags)
}

result <- lapply(s,tagPOS)
result <- as.data.frame(do.call(rbind,result))

That's how I reached the output which was described at the beginning 这就是我达到开头所描述的输出的方式

I have tried to count occurrences like this: occurrences<-as.data.frame (table(unlist(result$POStags))) 我试图计算这样的出现次数：发生次数<-as.data.frame（table（unlist（result $ POStags）））

But it count occurrences through the whole dataframe. 但它会统计整个数据帧中的出现次数。 I need to create new column to existing dataframe and count occurrences in the first column. 我需要为现有数据框创建一个新列，并计算第一列中的出现次数。

Can anyone help me please? 谁能帮我吗？ :( :(

Answer 1

using tm is relatively painfree: 使用tm相对容易：

dummy data 虚拟数据

require(tm)
   df <- data.frame(ID = c("doc1","doc2"), 
                    tags = c(paste("NN"), 
                             paste("DT", "NN", "VBD", ",", "WP",   "VBP", "PRP", "VBG", "TO", "VB", ".")))

make corpus and DocumentTermMatrix: 使语料库和DocumentTermMatrix：

corpus <- Corpus(VectorSource(df$tags))
#default minimum wordlength is 3, so make sure you change this
dtm <- DocumentTermMatrix(corpus, control= list(wordLengths=c(1,Inf)))

#see what you've done
inspect(dtm)

<<DocumentTermMatrix (documents: 2, terms: 9)>>
Non-/sparse entries: 10/8
Sparsity           : 44%
Maximal term length: 3
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs dt nn prp to vb vbd vbg vbp wp
   1  0  1   0  0  0   0   0   0  0
   2  1  1   1  1  1   1   1   1  1

eta: if you dislike working with a dtm, you can coerce it to a dataframe: eta：如果您不喜欢使用dtm，可以将其强制为数据框：

as.data.frame(as.matrix(dtm))

  nn dt prp to vb vbd vbg vbp wp
1  1  0   0  0  0   0   0   0  0
2  1  1   1  1  1   1   1   1  1

eta2: Corpus creates a corpus of column df$tags only, and VectorSource assumes that each row in the data is one document, so the order of rows in the dataframe df , and the order of documents in the DocumentTermMatrix are the same: i can cbind df$ID onto the output dataframe. eta2： Corpus创建一个列df$tags的语料库， VectorSource假定数据中的每一行都是一个文档，因此数据帧df的行顺序与DocumentTermMatrix中的DocumentTermMatrix顺序相同：我可以cbind df$ID cbind到输出数据帧上。 I do this using dplyr because i think it results in the most readable code (read %>% as "and then") : 我使用dplyr进行此操作，因为我认为它会产生最易读的代码（将%>%读为“ then then”）：

require(dplyr)
result <- as.data.frame(as.matrix(dtm)) %>%
          bind_col(df$ID)

按列计数POS标签

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-05-08 11:26:22

按列计数POS标签

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-05-08 11:26:22

解决方案1
1 已采纳 2017-05-08 11:26:22