I am wondering is there any solution to utilize the separate
function for arrow
tables? The column data organizing should perform these type of data manipulation faster than for data.frame
separate
itself is not supported, but sometimes we can use sub
and supported functions to get what we need. For example,
library(dplyr)
library(arrow) # 10.0.0
# from ?tidyr::separate
df <- data.frame(x = c(NA, "x.y", "x.z", "y.z"))
write_parquet(df, "quux.parquet")
ds <- open_dataset("quux.parquet")
ds %>%
tidyr::separate(x, c("A", "B"))
# Error in UseMethod("separate") :
# no applicable method for 'separate' applied to an object of class "c('FileSystemDataset', 'Dataset', 'ArrowObject', 'R6')"
df %>%
tidyr::separate(x, c("A", "B"))
# A B
# 1 <NA> <NA>
# 2 x y
# 3 x z
# 4 y z
Similar, using sub
and family:
df %>%
mutate(A = sub("\\..*", "", x), B = sub(".*\\.", "", x))
# x A B
# 1 <NA> <NA> <NA>
# 2 x.y x y
# 3 x.z x z
# 4 y.z y z
ds %>%
mutate(A = sub("\\..*", "", x), B = sub(".*\\.", "", x))
# FileSystemDataset (query)
# x: string
# A: string (replace_substring_regex(x, {pattern="\..*", replacement="", max_replacements=1}))
# B: string (replace_substring_regex(x, {pattern=".*\.", replacement="", max_replacements=1}))
# See $.data for the source Arrow object
ds %>%
mutate(A = sub("\\..*", "", x), B = sub(".*\\.", "", x)) %>%
collect()
# x A B
# 1 <NA> <NA> <NA>
# 2 x.y x y
# 3 x.z x z
# 4 y.z y z
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.