简体   繁体   中英

Is the separate function work in arrow tables in R?

I am wondering is there any solution to utilize the separate function for arrow tables? The column data organizing should perform these type of data manipulation faster than for data.frame

separate itself is not supported, but sometimes we can use sub and supported functions to get what we need. For example,

library(dplyr)
library(arrow) # 10.0.0
# from ?tidyr::separate
df <- data.frame(x = c(NA, "x.y", "x.z", "y.z"))
write_parquet(df, "quux.parquet")
ds <- open_dataset("quux.parquet")
ds %>%
  tidyr::separate(x, c("A", "B"))
# Error in UseMethod("separate") : 
#   no applicable method for 'separate' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
df %>%
  tidyr::separate(x, c("A", "B"))
#      A    B
# 1 <NA> <NA>
# 2    x    y
# 3    x    z
# 4    y    z

Similar, using sub and family:

df %>%
  mutate(A = sub("\\..*", "", x), B = sub(".*\\.", "", x))
#      x    A    B
# 1 <NA> <NA> <NA>
# 2  x.y    x    y
# 3  x.z    x    z
# 4  y.z    y    z
ds %>%
  mutate(A = sub("\\..*", "", x), B = sub(".*\\.", "", x))
# FileSystemDataset (query)
# x: string
# A: string (replace_substring_regex(x, {pattern="\..*", replacement="", max_replacements=1}))
# B: string (replace_substring_regex(x, {pattern=".*\.", replacement="", max_replacements=1}))
# See $.data for the source Arrow object
ds %>%
  mutate(A = sub("\\..*", "", x), B = sub(".*\\.", "", x)) %>%
  collect()
#      x    A    B
# 1 <NA> <NA> <NA>
# 2  x.y    x    y
# 3  x.z    x    z
# 4  y.z    y    z

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM