简体   繁体   中英

Extract text between brackets in character column (create new column) of R dataframe

Apologies if the title is a bit wordy, hopefully this example will be helpful. I have the following dataset:

my_df
                                     Description thisYVal thisPts
1                     (12:00)   Start Period        0       0
2        (12:00)   Jump Ball Thomas vs Grant        0       0
3      (11:48) [MIA 3-] Wade Layup Shot: Missed     0       2
4  (11:46) [PHL] Thomas Rebound (Off: Def:1)        0       0
6     (11:02) [MIA] Haslem Jump Shot: Missed      -19       2
7  (11:00) [MIA] Haslem Rebound (Off:1 Def:)        0       0
8    (10:57) [MIA] Haslem Layup Shot: Missed        0       2
9 (10:56) [PHL] Coleman Rebound (Off: Def:1)        0       0

dput(my_df)
structure(list(Description = c("(12:00)   Start Period", "(12:00)   Jump Ball Thomas vs Grant", 
"(11:48) [MIA 3-] Wade Layup Shot: Missed", "(11:46) [PHL] Thomas Rebound (Off: Def:1)", 
"(11:02) [MIA] Haslem Jump Shot: Missed", "(11:00) [MIA] Haslem Rebound (Off:1 Def:)", 
"(10:57) [MIA] Haslem Layup Shot: Missed", "(10:56) [PHL] Coleman Rebound (Off: Def:1)"
), thisYVal = c(0L, 0L, 0L, 0L, -19L, 0L, 0L, 0L), thisPts = c(0L, 
0L, 2L, 0L, 2L, 0L, 2L, 0L)), row.names = c(1L, 2L, 3L, 4L, 6L, 
7L, 8L, 9L), class = "data.frame")

... and I would like to extract the 3-letter team abbreviation that appears in the Description column of the dataframe.

The 3-letter description always follows the first opening-square-bracket [ , although it is not always followed by the closing bracket ] (as you can see in row 3 of the dataframe).

I have been trying to do this using the substr() function but have had no luck so far. Any help is appreciated !

EDIT: some additional context - some rows (1 and 2, in this case) do not have a [] or a team-abbreviation. In these instances, the dataframe could return a blank string, an NA, or something else.

EDIT-2: just incase since it wasnt explicitly mentioned - a 4th column with c("", "", "MIA", "PHL", "MIA", "MIA", "MIA", "PHL") is what I am trying to get

Edit-3: The following gets me close, but not quite there

my_df %>% 
  dplyr::mutate(teamAbb = unlist(stringr::str_extract(Description, "\\[(.*)\\]")))

R recently introduced strcapture to its standard utils package:

strcapture("(?<=\\[)(.{3})", dat$Description, proto=list(out=character()), perl=TRUE)
#   out
#1 <NA>
#2 <NA>
#3  MIA
#4  PHL
#5  MIA
#6  MIA
#7  MIA
#8  PHL

You can use str_match from the stringr package. Specifically, you will want to look for three capitalized letters (assuming all team abbreviations are three letters) after a left square bracket.

> str_match(df$Description, '\\[([A-Z]{3})')
     [,1]   [,2] 
[1,] NA     NA   
[2,] NA     NA   
[3,] "[MIA" "MIA"
[4,] "[PHL" "PHL"
[5,] "[MIA" "MIA"
[6,] "[MIA" "MIA"
[7,] "[MIA" "MIA"
[8,] "[PHL" "PHL"

You'll note that the team abbreviation pattern is actually in parenthesis; that's because it's a subgroup of the pattern that we want to extract. As such, str_match returns (1) the entire pattern, and (2) the subgroup(s) specified in parenthesis. Therefore, in this case, we want to take the second column, which contains matches from the first subgroup.

df$Team <- str_match(df$Description, '\\[([A-Z]{3})')[,2]

This gives us the desired result:

                                 Description Team
1                     (12:00)   Start Period <NA>
2        (12:00)   Jump Ball Thomas vs Grant <NA>
3  (11:48) [MIA 3-] Wade Layup Shot: Missed   MIA
4  (11:46) [PHL] Thomas Rebound (Off: Def:1)  PHL
5     (11:02) [MIA] Haslem Jump Shot: Missed  MIA
6  (11:00) [MIA] Haslem Rebound (Off:1 Def:)  MIA
7    (10:57) [MIA] Haslem Layup Shot: Missed  MIA
8 (10:56) [PHL] Coleman Rebound (Off: Def:1)  PHL

Here is another option that looks for 3 non-digits after a bracket and places them in a new column called Team:

library(tidyverse)

df %>% mutate(Team = str_extract(Description, "(?<=\\[)\\D{3}"))
#>                                  Description thisYVal thisPts Team
#> 1                     (12:00)   Start Period        0       0 <NA>
#> 2        (12:00)   Jump Ball Thomas vs Grant        0       0 <NA>
#> 3   (11:48) [MIA 3-] Wade Layup Shot: Missed        0       2  MIA
#> 4  (11:46) [PHL] Thomas Rebound (Off: Def:1)        0       0  PHL
#> 5     (11:02) [MIA] Haslem Jump Shot: Missed      -19       2  MIA
#> 6  (11:00) [MIA] Haslem Rebound (Off:1 Def:)        0       0  MIA
#> 7    (10:57) [MIA] Haslem Layup Shot: Missed        0       2  MIA
#> 8 (10:56) [PHL] Coleman Rebound (Off: Def:1)        0       0  PHL

Created on 2018-09-09 by the reprex package (v0.2.0).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM