R：如何根據符號拆分特定的列？

Question

我正在嘗試在文本挖掘過程中進行POS標記。

這是我的POS標記結果格式。

  Word & POS Tag
1 cmp/NN conditioner/NN
2 contains/VBZ the/DT grinding/VBG
3 diamond/NN

但它與POS標簽混合在一起。 我更喜歡這樣的格式：

  Word                     POS Tag
1 cmp conditioner          NN-NN
2 contains the grinding    VBZ-DT-VBG
3 diamond                  NN

反正有R中的單詞和POS標簽嗎？

Answer 1

用空字符串分別替換/之前和之后的部件。 不使用任何軟件包。

cbind(gsub("/\\w+", "", L), gsub(" ", "-", gsub("\\w+/", "", L)))

贈送：

     [,1]                    [,2]        
[1,] "cmp conditioner"       "NN-NN"     
[2,] "contains the grinding" "VBZ-DT-VBG"
[3,] "diamond"               "NN"

注意：以可重復形式輸入的假定為：

L <- c("cmp/NN conditioner/NN", "contains/VBZ the/DT grinding/VBG", "diamond/NN")

Answer 2

在使用readLines讀取數據集后，我們可以使用str_extract提取子字符串

v1 <- sapply(str_extract_all(lines[-1], "\\w+(?=[/])"), paste, collapse=" ")
v2 <- sapply(str_extract_all(lines[-1], "(?<=[/])\\w+"), paste, collapse="-")
nm1 <- trimws(scan(text=lines[1], what = "", sep="&", quiet =TRUE))
d1 <- setNames(data.frame(v1, v2, stringsAsFactors= FALSE), nm1)
d1
#                   Word    POS Tag
#1       cmp conditioner      NN-NN
#2 contains the grinding VBZ-DT-VBG
#3               diamond         NN

注意：使用的tidyverse軟件包緊湊且易於使用

或另一種選擇是strsplit

 do.call(rbind, lapply(strsplit(lines[-1], "[/ ]"), function(x) {
      x1 <- x[-1]; c(paste(x1[c(TRUE, FALSE)], collapse=" "), 
      paste(x1[c(FALSE, TRUE)], collapse="-"))}))
 #       [,1]                    [,2]        
 #[1,] "cmp conditioner"       "NN-NN"     
 #[2,] "contains the grinding" "VBZ-DT-VBG"
 #[3,] "diamond"               "NN"

注意：絕對不使用任何軟件包-100％確認

數據

lines <- readLines("file.txt")

R：如何根據符號拆分特定的列？

問題描述

2 個解決方案

解決方案1
2 已采納 2017-08-11 04:13:02

解決方案2
1 2017-08-11 03:54:32

數據

R：如何根據符號拆分特定的列？

問題描述

2 個解決方案

解決方案1 2 已采納 2017-08-11 04:13:02

解決方案2 1 2017-08-11 03:54:32

數據

解決方案1
2 已采納 2017-08-11 04:13:02

解決方案2
1 2017-08-11 03:54:32