简体繁体中英

Storm to Cassandra

原文 2016-08-29 02:52:18 9 1 java/ apache-spark/ cassandra/ apache-kafka/ apache-storm

I am working on a requirement where I need to read sensor data from csv/tsv and insert into Cassandra db.

CSV Format:

sensor1 timestamp1 value
sensor1 timestamp2 value
sensor2 timestamp1 value
sensor2 timestamp3 value

Details:

User can upload a file to our web application. Once the file is uploaded, I need to display unique values from a column to User in the next page. For example ->

sensor1 node1
sensor2 node2
sensorn create

User can either map a sensor1 with existing primary key called node1, in this case timestamps and values for sensor1 will be added to a table where primary key is equal to node1 or create primary key, in this case timestamps and values will be added with the new primary key.

I was able to implement this using Java8 streaming and collection. This is working with small csv file.

Question:

How can I upload huge csv/ tsv file (200 gb) to my web application? Shall I upload the file in HDFS and specify the path in UI? I have even split the huge file into small chunks (50 MB each).
How can I get unique values from first column? Can I use Kafka/ spark here? I need to insert timestamp/ value to Cassandra db. Again Can I use Kafka/ Spark here?

Any help is highly appreciated.

1 answers

How can I upload huge csv/ tsv file (200 gb) to my web application? Shall I upload the file in HDFS and specify the path in UI? I have even split the huge file into small chunks (50 MB each).

Depends on how your web app is going to be used. Uploading a file of such a huge size during the context of a HTTP request from a client to the server is always going to be tricky. You have to do it asynchronously. Whether you put that in HDFS or S3 or even a simple SFTP server is a matter of design choice and that choice will affect what kinds of tools you want to build around the file. I would suggest start with something simple like FTP/NAS and as you have needs to scale, you could use something like S3. (Using HDFS as a shared file storage is something I haven't seen many people do, but that shouldn't prohibit you from trying)

How can I get unique values from first column? Can I use Kafka/ spark here? I need to insert timestamp/ value to Cassandra db. Again Can I use Kafka/ Spark here?

Spark batch or even a normal M/R job would do the trick for you. This is just a simple groupBy operation, though you should really look at how far you are willing to sacrifice on the latency, as groupBy operations are generally costly (it involves shuffles). Generally, from my limited experience, using streaming for use-cases is slightly overkill, unless you get a continuous stream of source data. But the way you have described your use-case looks more a batch candidate for me.

Some things I would focus on: how do I transfer my file from the client app, what are my end-to-end SLAs for availability of data in Cassandra, what happens when there are failures (do we retry, etc.), how often my jobs will be run (will it be triggered every time user uploads the file or it can be a cron job), etc.

Java code is not processing a huge CSV file

Apache Spark - Parallel Processing of messages from Kafka - Java

Unable to consume kafka messages using apache storm

Unable to submit topology to apache storm using kafka-storm

Reading huge CSV file with Spark

Processing Huge CSV File using Producer - Consumer Pattern

Error in retrieving data from Cassandra using Apache Spark Connector in java

Apache Spark Streaming with Java & Kafka

CSV to RDD to Cassandra store in Apache Spark

Storing data as hive table in Apache Spark using Java

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Java code is not processing a huge CSV file Apache Spark - Parallel Processing of messages from Kafka - Java Unable to consume kafka messages using apache storm Unable to submit topology to apache storm using kafka-storm Reading huge CSV file with Spark Processing Huge CSV File using Producer - Consumer Pattern Error in retrieving data from Cassandra using Apache Spark Connector in java Apache Spark Streaming with Java & Kafka CSV to RDD to Cassandra store in Apache Spark Storing data as hive table in Apache Spark using Java

Related Tags

Java Huge csv file processing and storing using Apache Spark/ Kafka/ Storm to Cassandra

Question

CSV Format:

Details:

Question:

1 answers

solution1 1 ACCPTED 2016-08-29 04:29:21

solution1
1 ACCPTED 2016-08-29 04:29:21