I have two very large csv files and I'm using spark with R. My first file was uploaded this way:
data <- spark_read_csv(sc, "D:/my_file.csv")
After working with first file I have these variables:
Name | Number
The second csv file that has these variables:
Name | Number | Surname
You can also see that the second file has one more variable than the first. I would like to ignore the Surname
column of the second file when loading with spark. How can I combine the two files so that the second is the continuum of the first?
From what I gather, you want to get rid of the Surname
column in your second dataframe and make a union with the first.
spark_read_csv
seems to come from sparklyr
that I have never used but in plain SparkR
, we could read data like below. I am pretty sure that the rest of the code would work the same way, regardless of the way the data is read.
> d1 = read.df(".../f1.csv", "csv", header="true")
> head(d1)
Name Number
1 x 7
2 y 8
> d2 = read.df(".../f2.csv", "csv", header="true")
> head(d2)
Name Number Surname
1 z 5 zz
2 w 6 ww
Then, it is pretty straightforward:
> trimmed_d2 = select(d2, "Name", "Number")
> all_the_data = union(d1, trimmed_d2)
> head(all_the_data)
Name Number
1 x 7
2 y 8
3 z 5
4 w 6
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.