简体   繁体   中英

CSV to RDD to Cassandra store in Apache Spark

I have a bunch of data in a csv file which I need to store into Cassandra through spark. I'm using the spark to cassandra connector for this. Normally to store into Cassandra , I create a Pojo and then serialize it to RDD and then store :

Employee emp = new Employee(1 , 'Mr', 'X');
JavaRDD<Employee>  empRdd = SparkContext.parallelize(emp);

Finally I write this to cassandra as :

CassandraJavaUtil.javaFunctions(empRdd, Emp.class).saveToCassandra("dev", "emp");

This is fine , but my data is stored in a csv file. Every line represents a tuple in cassandra database.

I know I can read each line , split the columns , create object using the column values , add it to a list and then finally serialize the entire list. I was wondering if there is an easier more direct way to do this ?

Well you could just use the SSTableLoader for BulkLoading and avoid spark altogether. If you rely on spark then I think you're out of luck... Although I am not sure how much easier than reading line by line and splitting the lines is even possible...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM