如何将POS与单词分开

Question

Need to create an text sparce matrix (DTM) for classification. 需要创建文本sparce矩阵（DTM）进行分类。 To prepare the text, first I need to eliminate (separate) the POS-tags the text. 要准备文本，首先我需要消除（分离）POS标签文本。 My guess was to do it like below. 我的猜测就是在下面这样做。 I'm new to R and don't now how to negate a REGEX (see below NOT!). 我是R的新手，现在不知道如何否定一个REGEX（见下面的NOT！）。

text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")

My guess how it could work: 我的猜测它是如何工作的：

(POSs <- regmatches(text, gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text)))
[[1]]
[1] "/KOUS"  "/VVFIN" "./$."  

[[2]]
[1] "/VVFIN" "/PTKVZ" ";/$."  

[[3]]
[1] "-/TRUNC" "/APPR"   "/KON"   

[[4]]
[1] "/PIS"  "/ADJD" "./$." 

[[5]]
[1] "/NN"    "!!!/NE"

But don't konw how to negate the expression like: 但是不要知道如何否定表达式：

#                          VVV
(texts <- regmatches(text, NOT!(gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text))))
[[1]]
[1] "wenn"  "ausläuft"  

[[2]]
[1] "Kommt" "vor"  

[[3]]
[1] "Durch"   "und"   

[[4]]
[1] "man"  "zügig"

[[5]]
[1] "empfehlung"

Answer 1

One possibility is to eliminate the tags by, searching for POS-tags and replacing them with '' (ie empty text): 一种可能性是消除标签，搜索POS标签并用'' （即空文本）替换它们：

text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")

(textlist <- strsplit(paste(gsub('[[:punct:]]*/[[:alpha:][:punct:]]*','', text), sep=' '), " "))
[[1]]
[1] "wenn"     "ausläuft"

[[2]]
[1] "Kommt" "vor"  

[[3]]
[1] "-RRB"  "Durch" "und"  

[[4]]
[1] "man"   "zügig"

[[5]]
[1] "empfehlung"

With the friendly help of rawr 在rawr的友好帮助下

如何将POS与单词分开

问题描述

1 个解决方案

解决方案1
1 2014-02-22 04:24:53

如何将POS与单词分开

问题描述

1 个解决方案

解决方案1 1 2014-02-22 04:24:53

解决方案1
1 2014-02-22 04:24:53