[英]Creating an R data.frame column based on the difference between two character columns
我有一個data.frame,df,其中我有2列,一列是歌曲的標題,另一列是合並的標題和藝術家。 我希望創建一個單獨的藝術家領域。 前三行顯示在這里
title titleArtist
I'll Never Smile Again I'll Never Smile Again TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS
Imagination Imagination GLENN MILLER & HIS ORCHESTRA / RAY EBERLE
The Breeze And I The Breeze And I JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY
此代碼對此數據集沒有任何問題
library(stringr)
library(dplyr)
df %>%
head(3) %>%
mutate(artist=str_to_title(str_trim(str_replace(titleArtist,title,"")))) %>%
select(artist,title)
artist title
1 Tommy Dorsey & His Orchestra / Frank Sinatra & Pied Pipers I'll Never Smile Again
2 Jimmy Dorsey & His Orchestra / Bob Eberly The Breeze And I
3 Glenn Miller & His Orchestra / Ray Eberle Imagination
但是,當我將它應用於數千行時,我得到了錯誤
Error: Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)
#or for part of the mutation
df$artist <-str_replace(df$titleArtist,df$title,"")
Error in stri_replace_first_regex(string, pattern, replacement, opts_regex = attr(pattern, :
Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)
我已從列中刪除所有括號,代碼似乎在我收到錯誤之前工作了一段時間
Error: Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)
是另一個可能導致問題的特殊角色還是其他東西?
TIA
您的一般問題是str_replace
將您的artist
值視為正則表達式,因此由於括號之外的特殊字符而存在許多潛在錯誤。 stringr
包裝和簡化的stringi
庫允許更細粒度的控件,包括將參數視為固定字符串而不是正則表達式。 我沒有您的原始數據,但是當我在以下位置拋出一些導致錯誤的字符時,這是有效的:
library(dplyr)
library(stringi)
df = data_frame(title = c("I'll Never Smile Again (", "Imagination.*", "The Breeze And I(?>="),
titleArtist = c("I'll Never Smile Again ( TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS",
"Imagination.* GLENN MILLER & HIS ORCHESTRA / RAY EBERLE",
"The Breeze And I(?>= JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY"))
df %>%
mutate(artist=stri_trans_totitle(stri_trim(stri_replace_first_fixed(titleArtist,title,"")))) %>%
select(artist,title)
結果:
Source: local data frame [3 x 2]
artist title
(chr) (chr)
1 Tommy Dorsey & His Orchestra / Frank Sinatra & Pied Pipers I'll Never Smile Again (
2 Glenn Miller & His Orchestra / Ray Eberle Imagination.*
3 Jimmy Dorsey & His Orchestra / Bob Eberly The Breeze And I(?>=
df <- data.frame(ID=11:13, T_A=c('a/b','b/c','x/y')) # T_A Title/Artist
ID T_A
1 11 a/b
2 12 b/c
3 13 x/y
# Title Artist are separated by /
> within(df, T_A<-data.frame(do.call('rbind', strsplit(as.character(T_A), '/', fixed=TRUE))))
ID T_A.X1 T_A.X2
1 11 a b
2 12 b c
3 13 x y
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.