简体   繁体   中英

Separate a column with numeric and character into two columns using separate()

I have the following dataframe Teams :

Team               Year
012 Hortney        2017
012 Hortney        2018
013 James          2017
013 James          2018
014 Ilωus hero     2017
014 Ilωus hero     2018
015 Hortna         2017
015 Hortna         2018
016 Exclus race    2017
#with 25 000 more rows

And would like to transform it into the below df:

code    name         Year
012   Hortney        2017
012   Hortney        2018
013   James          2017
013   James          2018
014   Ilωus hero     2017
014   Ilωus hero     2018
015   Hortna         2017
015   Hortna         2018
016   Exclus race    2017
#with 25 000 more rows

I've tried this code separate(Team, c("code", "name")) but it makes the names of data strange (especially the Greek letter (ω), where everything after it disappears, and I must have ω intact for later coding. The last part of the name disappears as well in Exclus. Like this: (within brackets what I'm looking for)

code   name          Year
012   Hortney        2017
012   Hortney        2018
013   James          2018
014   Il             2017   (Ilωus hero)
014   Il             2018   (Ilωus hero)
015   Hortna         2017
015   Hortna         2018
016   Exclus         2017   (Exclus race)
#With 25 00 more rows

Anyone have any ideas?

Try this

library(dplyr)

Teams |> 
      mutate(code = gsub("\\D" , "" , Team) ,
      name = trimws(gsub("\\d" , "" , Team))) |>
      select(code , name , Year)
  • output
  code        name Year
1  012     Hortney 2017
2  012     Hortney 2018
3  013       James 2017
4  013       James 2018
5  014  Ilωus hero 2017
6  014  Ilωus hero 2018
7  015      Hortna 2017
8  015      Hortna 2018
9  016 Exclus race 2017

With some regex and the stringr package:

require(stringr)

Teams = data.frame(Team = c(

"012 Hortney", " 013  James ", " 018 Alain Philippe have a very long name"),  

Year = c( 2017,  2018, 2017) ) # The data for reproducible example

dplyr::mutate(Teams, 

           Team = str_squish(Team), # Supress the unwanted space in variable Team

           code = str_extract(Team, "[0-9]*"), 
# Extract the first successive digits in the variable Team

           name = str_extract(Team, "[:alpha:]+[ ?[:alpha:]]*") ) %>%
# Extract the first successive letters of the variable Team, possibly with a space between the letters.

dplyr::select(code, name, Year)

Output

 code                                 name Year
1  012                              Hortney 2017
2  013                                James 2018
3  018 Alain Philippe have a very long name 2017

Please note that the variable 'code' will be exclusively composed of the first digits in the variable'Team', with no "." or "-" between the numbers. For example a number like '01.18' before the team name will result in a 'code' variable with the value '01': the regex will be blocked by the character ('.') in the team number.

You could use the sep argument in separate with a regex matching the space after a numeric and before a character (= only the first space) with some look-arounds. See below.

The reason why the greek letter works as a separator in your code is that the standard value of sep is [^[:alnum:]]+ which captures sequences of non-alphanumeric values (here greek!).

library(tidyr)

df |>
  separate(Team, into = c("code", "name"), sep = "(?<=\\d) (?=\\w)")

Output:

# A tibble: 9 × 3
  code  name         Year
  <chr> <chr>       <dbl>
1 012   Hortney      2017
2 012   Hortney      2018
3 013   James        2017
4 013   James        2018
5 014   Ilωus hero   2017
6 014   Ilωus hero   2018
7 015   Hortna       2017
8 015   Hortna       2018
9 016   Exclus race  2017

Data:

library(readr)

df <- read_csv("Team,               Year
012 Hortney,        2017
012 Hortney,        2018
013 James,          2017
013 James,          2018
014 Ilωus hero,     2017
014 Ilωus hero,     2018
015 Hortna,         2017
015 Hortna,         2018
016 Exclus race,    2017")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM