简体   繁体   中英

Separate strings by a character in columns in r

I have a column in a dataframe with scraped prices like this:

prices
$1,50 $1,20
$1,50
$1,75 $1,25 $1,35

In summary in each column I can have many prices. What I would like is to obtain different columns that are separated from $, this is what I need based on the example that I put.

prices               price1 price2 price3
$1,50 $1,20          1,50   1,20   NA
$1,50                1,50   NA     NA
$1,75 $1,25 $1,35    1,75   1,25   1,35

I have tried the following but neither option does what I need. Help

str_split(prices, pattern = '[$]') # I get a column with values like this c("", "1,50")
separate(prices, sep = '[$]', into = c("price1", "price2"), remove = FALSE) 
#Price1 is created empty and I am trying to use it in a function, 
#so in some dataframes the number of prices can vary.

One approach using dplyr :

df %>% 
  rowwise() %>% 
  mutate(price = list(gsub("$", "",strsplit(prices, " ")[[1]],fixed = T))) %>% 
  unnest_wider(price,names_sep = "")

Output:

  prices            price1 price2 price3
  <chr>             <chr>  <chr>  <chr> 
1 $1,50 $1,20       1,50   1,20   NA    
2 $1,50             1,50   NA     NA    
3 $1,75 $1,25 $1,35 1,75   1,25   1,35 

Input:

df = structure(list(prices = c("$1,50 $1,20", "$1,50", "$1,75 $1,25 $1,35"
)), class = "data.frame", row.names = c(NA, -3L))

in base R you could do:

read.table(text=df$prices, fill=TRUE, header = FALSE, sep='$', dec = ',')[-1]
    V2   V3   V4
1 1.50 1.20   NA
2 1.50   NA   NA
3 1.75 1.25 1.35

And if you dont want them as numeric but as character with , in them you can do:

read.table(text=df$prices, fill=TRUE, header=FALSE, sep='$', na.strings='')[-1]
     V2    V3   V4
1 1,50   1,20 <NA>
2  1,50  <NA> <NA>
3 1,75  1,25  1,35

You can the change the names: ie set the names to paste0('prices', seq(ncol(df1))

If your default locale has comma as the decimal separator, then:

library(tidyverse)
options("readr.default_locale" = readr::locale(decimal_mark = ","))

df <- tibble(prices =
               c("$1,50 $1,20",
                 "$1,50",
                 "$1,75 $1,25 $1,35"))
df |>
  mutate(prices = prices |>
           str_split(" ") |>
           map( ~ str_remove(., "\\$"))) |>
  unnest_wider(prices) |>
  mutate(across(.fns = readr::parse_number))
#> New names:
#> New names:
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> # A tibble: 3 × 3
#>    ...1  ...2  ...3
#>   <dbl> <dbl> <dbl>
#> 1  1.5   1.2  NA   
#> 2  1.5  NA    NA   
#> 3  1.75  1.25  1.35

Otherwise:

df |>
  mutate(prices = prices |>
           str_split(" ") |>
           map( ~ str_remove(., "\\$"))) |>
  unnest_wider(prices) |>
  mutate(across(.fns = ~ readr::parse_number(., locale = readr::locale(decimal_mark = ","))))
#> New names:
#> New names:
#> New names:
#> • `` -> `...1`
#> • `` -> `...2`
#> # A tibble: 3 × 3
#>    ...1  ...2  ...3
#>   <dbl> <dbl> <dbl>
#> 1  1.5   1.2  NA   
#> 2  1.5  NA    NA   
#> 3  1.75  1.25  1.35

With cSplit :

library(splitstackshape)
s <- cSplit(df, "prices", "$", type.convert = T)[, -1]
df[, paste0("price", 1:ncol(s))] <- s

#             prices  price1  price2  price3
#1       $1,50 $1,20    1,50    1,20    <NA>
#2             $1,50    1,50    <NA>    <NA>
#3 $1,75 $1,25 $1,35    1,75    1,25    1,35

In this approach we convert the data to long form using separate_rows , transform it using transform and convert back to wide form using reshape . We use a mix of dplyr, tidyr and base functions choosing among them based on which ever gives shorter code.

1) Add a P column which is the same as prices, separate the prices column into rows, add a column row which numbers the rows and n which numbers them within prices and then convert to wide form. reshape is a bit less code than pivot_wider in this case but the latter could have been used. Also we use transform which is like mutate except it outputs a data frame which we need for reshape. At the end select out what we need.

library(dplyr)
library(tidyr)

DF %>% 
  mutate(P = prices, prices = gsub("\\$", "", prices), row = 1:n()) %>% 
  separate_rows(prices, sep = " +") %>%
  transform(n = ave(1:nrow(.), row, FUN = seq_along))  %>%
  reshape(dir = "wide", idvar = c("row", "P"), timevar = "n", sep = "") %>%
  select(prices = P, everything(), -row)

giving:

             prices prices1 prices2 prices3
1       $1,50 $1,20    1,50    1,20    <NA>
3             $1,50    1,50    <NA>    <NA>
4 $1,75 $1,25 $1,35    1,75    1,25    1,35

2) If you want the prices column converted to numeric and if decimal point is dot in the current locale then use this which replaces the commas with dots and adds convert=TRUE to separate_rows . If comma is the decimal point in the current locale then omit the second mutate below.

DF %>% 
  mutate(P = prices, prices = gsub("\\$", "", prices), 
         prices = gsub(",", ".", prices),
         row = 1:n()) %>% 
  separate_rows(prices, sep = " +", convert = TRUE) %>%
  transform(n = ave(1:nrow(.), row, FUN = seq_along))  %>%
  reshape(dir = "wide", idvar = c("row", "P"), timevar = "n", sep = "") %>%
  select(prices = P, everything(), -row)

Note

The input in reproducible form:

DF <-
structure(list(prices = c("$1,50 $1,20", "$1,50", "$1,75 $1,25 $1,35"
)), class = "data.frame", row.names = c(NA, -3L))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM