简体   繁体   English

如何将POS与单词分开

[英]How to separate POS from words

Need to create an text sparce matrix (DTM) for classification. 需要创建文本sparce矩阵(DTM)进行分类。 To prepare the text, first I need to eliminate (separate) the POS-tags the text. 要准备文本,首先我需要消除(分离)POS标签文本。 My guess was to do it like below. 我的猜测就是在下面这样做。 I'm new to R and don't now how to negate a REGEX (see below NOT!). 我是R的新手,现在不知道如何否定一个REGEX(见下面的NOT!)。

text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")

My guess how it could work: 我的猜测它是如何工作的:

(POSs <- regmatches(text, gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text)))
[[1]]
[1] "/KOUS"  "/VVFIN" "./$."  

[[2]]
[1] "/VVFIN" "/PTKVZ" ";/$."  

[[3]]
[1] "-/TRUNC" "/APPR"   "/KON"   

[[4]]
[1] "/PIS"  "/ADJD" "./$." 

[[5]]
[1] "/NN"    "!!!/NE"

But don't konw how to negate the expression like: 但是不要知道如何否定表达式:

#                          VVV
(texts <- regmatches(text, NOT!(gregexpr('[[:punct:]]*/[[:alpha:][:punct:]]*', text))))
[[1]]
[1] "wenn"  "ausläuft"  

[[2]]
[1] "Kommt" "vor"  

[[3]]
[1] "Durch"   "und"   

[[4]]
[1] "man"  "zügig"

[[5]]
[1] "empfehlung"

One possibility is to eliminate the tags by, searching for POS-tags and replacing them with '' (ie empty text): 一种可能性是消除标签,搜索POS标签并用'' (即空文本)替换它们:

text <- c("wenn/KOUS ausläuft/VVFIN ./$.", "Kommt/VVFIN vor/PTKVZ ;/$.", "-RRB-/TRUNC Durch/APPR und/KON", "man/PIS zügig/ADJD ./$.", "empfehlung/NN !!!/NE")

(textlist <- strsplit(paste(gsub('[[:punct:]]*/[[:alpha:][:punct:]]*','', text), sep=' '), " "))
[[1]]
[1] "wenn"     "ausläuft"

[[2]]
[1] "Kommt" "vor"  

[[3]]
[1] "-RRB"  "Durch" "und"  

[[4]]
[1] "man"   "zügig"

[[5]]
[1] "empfehlung"

With the friendly help of rawr rawr的友好帮助下

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从POS标签中提取名词单词和原始句子 - Extract the noun words & original sentence from POS Tag 如何在“ di”单词中分隔前缀? - How to separate the prefix in words that are 'di'? 用来自单独数据帧的单词替换来自数据帧的字符串中的单词 - Replace words in a string from a dataframe with words from a separate dataframe R将单词与数字分开 - R separate words from numbers in string 如何从CSV文件中删除POS标签 - How to remove POS tags from a CSV file 如何区分大写和非大写的单词? - how to separate captitalize and non captialize words? 如何在两个单独的字符串中找到匹配的单词? - How to find matching words in two separate strings? Python:如何使用正则表达式将句子拆分为新行,然后使用空格将标点符号与单词分开? - Python: How can I use a regex to split sentences to new lines, and then separate punctuation from words using whitespace? Python-如何匹配文本文件中多行中的特定单词/数字并将它们存储在单独的列表中 - Python - how to match specific words / digits from multiple lines in a text file and store them in separate lists 如何使用 \w+ 将单词与数字分开(不使用 A-Ba-b 或 \d)? - How to separate words from numbers using \w+ (not using A-Ba-b or \d)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM