從data.frame列中提取單詞

Question

在我的數據中，有一列像：

df <- data.frame(status = c("GET/sfuksd1567","GET/sjsh787","POST/hsfhuks","GET/sfukfiezd17","POST/fshks"), stringsAsFactors = FALSE)

我想自動創建另一列，它是變量狀態的指示器，並且它僅提取“ GET”或“ POST”，例如df$ind=c("GET","GET","POST","GET","POST") 。

我已經嘗試過函數substr ，但是沒有成功。

原始數據：

> df
           status
1  GET/sfuksd1567
2     GET/sjsh787
3    POST/hsfhuks
4 GET/sfukfiezd17
5      POST/fshks

預期結果：

> df
           status  ind
1  GET/sfuksd1567  GET
2     GET/sjsh787  GET
3    POST/hsfhuks POST
4 GET/sfukfiezd17  GET
5      POST/fshks POST

Answer 1

您可以使用正則表達式簡單地刪除反斜杠后的所有內容

df$ind <- sub("/.*", "", df$status)
df
#            status  ind
# 1  GET/sfuksd1567  GET
# 2     GET/sjsh787  GET
# 3    POST/hsfhuks POST
# 4 GET/sfukfiezd17  GET
# 5      POST/fshks POST

或者，如果您不喜歡正則表達式，可以嘗試

library(tidyr)
separate(df, "status", c("ind", "status"))

要么

library(data.table) ## V1.9.6+
setDT(df)[, tstrsplit(status, "/")]

要么

read.table(text = df$status, sep = "/")

最后三個選項只是將status列分為兩個單獨的列。

Answer 2

我們有：

df<-data.frame(status=c("GET/sfuksd1567","GET/sjsh787","POST/hsfhuks","GET/sfukfiezd17","POST/fshks"),stringsAsFactors=F)

你可以做：

df$ind<-sapply(1:nrow(df),function(x){strsplit(df$status,'/')[[x]][1]})

要么

df$ind<-sapply(strsplit(df$status,'/'),`[[`,1)

都回來了

df
           status  ind
1  GET/sfuksd1567  GET
2     GET/sjsh787  GET
3    POST/hsfhuks POST
4 GET/sfukfiezd17  GET
5      POST/fshks POST

基准測試：

microbenchmark(david=sub("/.*", "", df$status),etienne=sapply(strsplit(df$status,'/'),`[[`,1))

Unit: microseconds
    expr    min      lq     mean  median     uq     max neval cld
   david 25.198 25.8985 27.64456 26.5980 27.298 116.189   100  a 
 etienne 62.294 63.3440 65.13979 63.8695 65.094 128.088   100   b

Answer 3

我們可以使用stri_extract_first_words從library(stringi)

library(stringi)
stri_extract_first_words(df$status)
#[1] "GET"  "GET"  "POST" "GET"  "POST"

tidyr另一個選擇是extract

extract(df, status, into='ind', '([^/]+)/.*', remove=FALSE)

基准測試

使用stri_extract_first_words ，基准是：

david <- function() sub('/.*', '', df$status)
etienne <- function() sapply(strsplit(df$status,'/'),`[[`,1)
akrun <- function()stri_extract_first_words(df$status)
df <-  df[sample(1:nrow(df), 1e6, replace=TRUE),, drop=FALSE]
library(microbenchmark)
microbenchmark(david(), etienne(), akrun(), unit='relative', times=20L)
#Unit: relative
#      expr      min       lq     mean   median       uq      max neval
#   david() 1.826192 1.824263 1.781562 1.814156 1.788085 1.699008    20
# etienne() 4.935629 5.159218 5.136180 5.198875 5.137107 5.930806    20
#   akrun() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20

注意：@David Arenburg的帖子中還有其他選項。 我猜sub版本會更快。 我可能是錯的。

從data.frame列中提取單詞

問題描述

3 個解決方案

解決方案1
10 已采納 2015-11-25 14:19:59

解決方案2
3 2015-11-25 14:15:28

解決方案3
2 2015-11-25 15:04:21

基准測試

從data.frame列中提取單詞

問題描述

3 個解決方案

解決方案1 10 已采納 2015-11-25 14:19:59

解決方案2 3 2015-11-25 14:15:28

解決方案3 2 2015-11-25 15:04:21

基准測試

解決方案1
10 已采納 2015-11-25 14:19:59

解決方案2
3 2015-11-25 14:15:28

解決方案3
2 2015-11-25 15:04:21