简体   繁体   English

R:在数据框列中查找特定字符的位置

[英]R: find position of specific character in data frame column

I have been trying to duplicate a move that I've used a lot with SQL but can't seem to find an equivalent in R. I've been searching high and low on the list and other sources for a solution but can't find what I'm looking to do. 我一直在尝试复制我已经在SQL中使用了很多的动作,但是似乎找不到R中的等效动作。我一直在列表中和其他来源中寻找高低的解决方案,但无法找到我想要做的事。

I have a data frame with a variable of full names, for example "Doe, John". 我有一个带有全名变量的数据框,例如“ Doe,John”。 I have been able to split these names using the following code: 我已经能够使用以下代码拆分这些名称:

# creates a split name matrix for each record
namesplit <- strsplit(crm$DEF_NAME, ',')

# takes the first/left part of matrix, after the comma
crm$LAST_NAME <- trimws(sapply(namesplit, function(x) x[1]))

# takes the last/right part of the matrix, after the comma 
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[length(x)])) 

But some of the names have "." 但是有些名称带有“。” instead of "," splitting the names. 而不是“,”拆分名称。 For example, "Doe. John". 例如,“ Doe。John”。 In other cases I have two ".", ie "Doe. John T.". 在其他情况下,我有两个“。”,即“ Doe。John T.”。 Here's an example: 这是一个例子:

> test$LAST_NAME
 [1] "DEWITT. B"             "TAOY. PETER"           "ZULLO. JASON"         
 [4] "LAWLOR. JOSEPH"        "CRAWFORD. ADAM"        "HILL. ROBERT W."      
 [7] "TAGERT. CHRISTOPHER"   "ROSEBERY. SCOTT W."    "PAYNE. ALBERT"        
[10] "BUNTZ. BRIAN JOHN"     "COLON. PERFECTO GAUD"  "DIAZ. JOSE CANO"      
[13] "COLON. ERIK D."        "COLON. ERIK D."        "MARTINEZ. DAVID C."   
[16] "DRISKELL. JASON"       "JOHNSON. ALEXANDER"    "JACKSON. RONNIE WAYNE"
[19] "SIPE. DAVID J."        "FRANCO. BRANDT"        "FRANCO. BRANDT"  

For these cases, I'm trying to find the position of the first "." 对于这些情况,我正在尝试查找第一个“”的位置。 so that I can use user-defined functions to split the name. 这样我就可以使用用户定义的函数来拆分名称。 Here are those functions. 这些是这些功能。

left = function (string,char){
  substr(string,1,char)}

right = function (string, char){
  substr(string,nchar(string)-(char-1),nchar(string))}

I've had some success with the following, but it takes the position of the first record only, so for example it'll grab position 6 for all the records rather than changing for each row. 我在以下方面取得了一些成功,但它仅占据第一条记录的位置,因此,例如,它将为所有记录获取位置6,而不是为每一行更改。

test$LAST_NAME2 <- left(test$LAST_NAME, 
   which(strsplit(test$LAST_NAME, '')[[1]]=='.')-1)

I've played around with apply and sapply, but I'm obviously missing something because they don't seem to work. 我玩过apply和sapply,但是我显然缺少了一些东西,因为它们似乎不起作用。

My plan was to use an ifelse function to apply the "." 我的计划是使用ifelse函数应用“。” parsing to the records that have this issue. 解析到有此问题的记录。

I fear the answer is simple. 我担心答案很简单。 But I'm stuck. 但是我被困住了。 Thanks so much for your help. 非常感谢你的帮助。

I would just modify your original function namesplit to this: 我只是将您的原始函数namesplit修改为:

 namesplit <- strsplit(crm$DEF_NAME, ',|\\.')

which will split on , or . 这将分割上,. .

Also, maybe change your first name function to 另外,也许将您的名字功能更改为

crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[2:length(x)]))

to catch any instances where there is a comma or period that is not in the last position. 捕获任何逗号或句点不在最后位置的情况。

With tidyr, 和提迪

library(tidyr)

test %>% separate(LAST_NAME, into = c('LAST_NAME', 'FIRST_NAME'), extra = 'merge')
##    LAST_NAME    FIRST_NAME
## 1     DEWITT             B
## 2     LAWLOR        JOSEPH
## 3     TAGERT   CHRISTOPHER
## 4      BUNTZ    BRIAN JOHN
## 5      COLON       ERIK D.
## 6   DRISKELL         JASON
## 7       SIPE      DAVID J.
## 8       TAOY         PETER
## 9   CRAWFORD          ADAM
## 10  ROSEBERY      SCOTT W.
## 11     COLON PERFECTO GAUD
## 12     COLON       ERIK D.
## 13   JOHNSON     ALEXANDER
## 14    FRANCO        BRANDT
## 15     ZULLO         JASON
## 16      HILL     ROBERT W.
## 17     PAYNE        ALBERT
## 18      DIAZ     JOSE CANO
## 19  MARTINEZ      DAVID C.
## 20   JACKSON  RONNIE WAYNE
## 21    FRANCO        BRANDT

Data 数据

test <-  structure(list(LAST_NAME = c("DEWITT. B", "LAWLOR. JOSEPH", "TAGERT. CHRISTOPHER", 
    "BUNTZ. BRIAN JOHN", "COLON. ERIK D.", "DRISKELL. JASON", "SIPE. DAVID J.", 
    "TAOY. PETER", "CRAWFORD. ADAM", "ROSEBERY. SCOTT W.", "COLON. PERFECTO GAUD", 
    "COLON. ERIK D.", "JOHNSON. ALEXANDER", "FRANCO. BRANDT", "ZULLO. JASON", 
    "HILL. ROBERT W.", "PAYNE. ALBERT", "DIAZ. JOSE CANO", "MARTINEZ. DAVID C.", 
    "JACKSON. RONNIE WAYNE", "FRANCO. BRANDT")), row.names = c(NA, 
    -21L), class = "data.frame", .Names = "LAST_NAME")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM