R: combine two csv files with spark

Question

I have two very large csv files and I'm using spark with R. My first file was uploaded this way:

data <- spark_read_csv(sc, "D:/my_file.csv")

After working with first file I have these variables:

Name | Number

The second csv file that has these variables:

Name | Number | Surname

You can also see that the second file has one more variable than the first. I would like to ignore the Surname column of the second file when loading with spark. How can I combine the two files so that the second is the continuum of the first?

Answer 1

From what I gather, you want to get rid of the Surname column in your second dataframe and make a union with the first.

spark_read_csv seems to come from sparklyr that I have never used but in plain SparkR , we could read data like below. I am pretty sure that the rest of the code would work the same way, regardless of the way the data is read.

> d1 = read.df(".../f1.csv", "csv", header="true")
> head(d1)
  Name Number
1    x      7
2    y      8

> d2 = read.df(".../f2.csv", "csv", header="true")
> head(d2)
  Name Number Surname
1    z      5      zz
2    w      6      ww

Then, it is pretty straightforward:

> trimmed_d2 = select(d2, "Name", "Number")
> all_the_data = union(d1, trimmed_d2)
> head(all_the_data)
  Name Number
1    x      7
2    y      8
3    z      5
4    w      6

R: combine two csv files with spark

Question

1 answers

solution1
0 2019-11-04 11:14:30

R: combine two csv files with spark

Question

1 answers

solution1 0 2019-11-04 11:14:30

solution1
0 2019-11-04 11:14:30