简体   繁体   中英

Is there a more memory efficient way to concatenate two columns in SparkR?

I'm trying to concatenate two columns from two spark data frames using SparkR eg


Frame1
ID Data1 Data2...
1 0 0
2 1 0
3 1 1
4 0 1
5 1 1


Frame2
OtherID Data1 Data2...
2 0 0
3 1 0
4 1 1
5 0 1
6 1 1

After concatenation:
Frame1
ID Data1 Data2...
12 0 0
23 1 0
34 1 1
45 0 1
56 1 1

As these are Spark frames, each column is treated as a S4 class, rather than a vector so attempting a simple paste(Frame1$ID, Frame2$OtherID) doesn't work as R can't coerce the S4 type into a vector. My current solution then, is to collect the two frames into regular R dataframes and then run the paste command:

r_frame1 <- collect(Frame1)
r_frame2 <- collect(Frame2)

r_frame1$id <- paste(r_frame1$ID, rframe2$OtherID, sep = "")

This works for some of my data, however I need to do this for several tables, all of which are very large (several million rows) and I'm getting java.lang.outofmemory errors in my Databricks environment. Is there a more memory efficient way of doing this? Perhaps one that doesn't involve collecting into a R dataframe? I have also tried running it as a SQL command using CONCAT, but ran into some other issues (see my previous related question: Databricks SQL CONCAT function adding characters where not requested? )

You can use the function concat :

Frame1$id = concat(Frame1$ID, Frame2$OtherID)

I you want to concatenate with a separator you can use concat_ws :

Frame1$id = concat_ws('_', Frame1$ID, Frame2$OtherID)

Note that if your columns are integer, you probably need to cast them to string before the concatenation with probably something like that:

Frame1$id = concat(cast(Frame1$ID, "string"), cast(Frame2$OtherID, "string"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM