简体   繁体   中英

Using Spark's MapReduce to call a different function and aggregate

I am woefully unfamiliar with spark but I'm pretty sure there exists a good way to do what I want much faster than I currently am doing it.

Essentially I have an S3 bucket that has lots of JSON of twitter data. I want to go through all of these files, grab the text from the JSON, do sentiment analysis (currently using Stanford NLP) on the text and then upload the Tweet + Sentiment to a database (right now I'm using dynamo, but this is not make-or-break)

The code I currently have is

        /**
         * Per thread:
         * 1. Download a file
         * 2. Do sentiment on the file -> output Map<String, List<Float>>
         * 3. Upload to Dynamo: (a) sentiment (b) number of tweets (c) timestamp
         *
         */

        List<String> keys = s3Connection.getKeys();

        ThreadPoolExecutor threads = new ThreadPoolExecutor(40, 40, 10000, TimeUnit.MILLISECONDS, new ArrayBlockingQueue<>(10));
        threads.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());

        for (String key : keys) {
                threads.submit(new Thread(() -> {
                    try {
                        S3Object s3Object = s3Connection.getObject(key);
                        Map<String, List<Float>> listOfTweetsWithSentiment = tweetSentimentService.getTweetsFromJsonFile(s3Object.getObjectContent());
                        List<AggregatedTweets> aggregatedTweets = tweetSentimentService.createAggregatedTweetsFromMap(listOfTweetsWithSentiment, key);

                        for (AggregatedTweets aggregatedTweet : aggregatedTweets) {
                            System.out.println(aggregatedTweet);
                            tweetDao.putItem(aggregatedTweet);
                        }
                    } catch (Exception e) {
                        System.out.println(e.getMessage());
                    }
                }));
        }

This works and is fine. And I was able to expedite the process to only about 2 hours by running this code on certain date ranges (ie getKeys only gets keys for certain date ranges) and spinning up many instances of this process across different EC2s, each one acting on a different date range.

However, there's gotta be a faster way to do this with a good ole map-reduce, but I just have no idea of how to even begin looking into this. Is it possible to do a Sentiment Analysis in my map and then reduce based on timestamp?

Further, I was looking into using AWS Glue but I don't see a good way to use the Stanford NLP library there.

Any and all help would be greatly appreciated.

Yes, you can do it with Apache Spark. There are a lot of ways to design your application, configure infrastructure, etc. I propose a simple design:

  1. You are on AWS, so create an EMR cluster with Spark. It would be useful to include Zeppelin for interactive debugging.

  2. Spark uses several data abstractions. Your friends are RDD and Datasets (read a doc about them). Code for reading data to Datasets may be the same:

     SparkSession ss = SparkSession.builder().getOrCreate(); Dataset<Row> dataset = ss.read("s3a://your_bucket/your_path");
  3. Now you have a Dataset<Row> . This is useful for SQL-like operations. For your analysis you need to convert it to a Spark RDD:

     JavaRDD<Tweet> analyticRdd = dataset.toJavaRDD().map(row -> { return TweetsFactory.tweetFromRow(row); });
  4. So, with analyticRdd you can do your analysis staff. Just don't forget to make all your services that work with data Serializable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM