简体   繁体   中英

R: combine two csv files with spark

I have two very large csv files and I'm using spark with R. My first file was uploaded this way:

data <- spark_read_csv(sc, "D:/my_file.csv")

After working with first file I have these variables:

Name | Number

The second csv file that has these variables:

Name | Number | Surname

You can also see that the second file has one more variable than the first. I would like to ignore the Surname column of the second file when loading with spark. How can I combine the two files so that the second is the continuum of the first?

From what I gather, you want to get rid of the Surname column in your second dataframe and make a union with the first.

spark_read_csv seems to come from sparklyr that I have never used but in plain SparkR , we could read data like below. I am pretty sure that the rest of the code would work the same way, regardless of the way the data is read.

> d1 = read.df(".../f1.csv", "csv", header="true")
> head(d1)
  Name Number
1    x      7
2    y      8

> d2 = read.df(".../f2.csv", "csv", header="true")
> head(d2)
  Name Number Surname
1    z      5      zz
2    w      6      ww

Then, it is pretty straightforward:

> trimmed_d2 = select(d2, "Name", "Number")
> all_the_data = union(d1, trimmed_d2)
> head(all_the_data)
  Name Number
1    x      7
2    y      8
3    z      5
4    w      6

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM