简体   繁体   English

使用 Spark 的 MapReduce 调用不同的函数并聚合

[英]Using Spark's MapReduce to call a different function and aggregate

I am woefully unfamiliar with spark but I'm pretty sure there exists a good way to do what I want much faster than I currently am doing it.我对 spark 非常不熟悉,但我很确定有一种好方法可以比我目前做的更快地做我想做的事情。

Essentially I have an S3 bucket that has lots of JSON of twitter data.本质上,我有一个 S3 存储桶,其中包含大量 JSON 的 Twitter 数据。 I want to go through all of these files, grab the text from the JSON, do sentiment analysis (currently using Stanford NLP) on the text and then upload the Tweet + Sentiment to a database (right now I'm using dynamo, but this is not make-or-break)我想浏览所有这些文件,从 JSON 中获取文本,对文本进行情感分析(目前使用斯坦福 NLP),然后将 Tweet + Sentiment 上传到数据库(现在我正在使用 dynamo,但是这个不是成败)

The code I currently have is我目前拥有的代码是

        /**
         * Per thread:
         * 1. Download a file
         * 2. Do sentiment on the file -> output Map<String, List<Float>>
         * 3. Upload to Dynamo: (a) sentiment (b) number of tweets (c) timestamp
         *
         */

        List<String> keys = s3Connection.getKeys();

        ThreadPoolExecutor threads = new ThreadPoolExecutor(40, 40, 10000, TimeUnit.MILLISECONDS, new ArrayBlockingQueue<>(10));
        threads.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());

        for (String key : keys) {
                threads.submit(new Thread(() -> {
                    try {
                        S3Object s3Object = s3Connection.getObject(key);
                        Map<String, List<Float>> listOfTweetsWithSentiment = tweetSentimentService.getTweetsFromJsonFile(s3Object.getObjectContent());
                        List<AggregatedTweets> aggregatedTweets = tweetSentimentService.createAggregatedTweetsFromMap(listOfTweetsWithSentiment, key);

                        for (AggregatedTweets aggregatedTweet : aggregatedTweets) {
                            System.out.println(aggregatedTweet);
                            tweetDao.putItem(aggregatedTweet);
                        }
                    } catch (Exception e) {
                        System.out.println(e.getMessage());
                    }
                }));
        }

This works and is fine.这有效并且很好。 And I was able to expedite the process to only about 2 hours by running this code on certain date ranges (ie getKeys only gets keys for certain date ranges) and spinning up many instances of this process across different EC2s, each one acting on a different date range.我能够通过在某些日期范围内运行此代码(即 getKeys 仅获取某些日期范围的密钥)并在不同的 EC2 上运行此过程的许多实例,每个实例都作用于不同的日期范围。

However, there's gotta be a faster way to do this with a good ole map-reduce, but I just have no idea of how to even begin looking into this.然而,必须有一种更快的方法来使用一个好的 ole map-reduce 来做到这一点,但我只是不知道如何开始研究这个。 Is it possible to do a Sentiment Analysis in my map and then reduce based on timestamp?是否可以在我的地图中进行情绪分析,然后根据时间戳减少?

Further, I was looking into using AWS Glue but I don't see a good way to use the Stanford NLP library there.此外,我正在考虑使用 AWS Glue,但我没有看到在那里使用斯坦福 NLP 库的好方法。

Any and all help would be greatly appreciated.任何和所有的帮助将不胜感激。

Yes, you can do it with Apache Spark.是的,您可以使用 Apache Spark 做到这一点。 There are a lot of ways to design your application, configure infrastructure, etc. I propose a simple design:有很多方法可以设计您的应用程序、配置基础设施等。我提出一个简单的设计:

  1. You are on AWS, so create an EMR cluster with Spark.您在 AWS 上,因此使用 Spark 创建 EMR 集群。 It would be useful to include Zeppelin for interactive debugging.包含 Zeppelin 进行交互式调试会很有用。

  2. Spark uses several data abstractions. Spark 使用多种数据抽象。 Your friends are RDD and Datasets (read a doc about them).您的朋友是 RDD 和数据集(阅读有关它们的文档)。 Code for reading data to Datasets may be the same:读取数据到 Datasets 的代码可能是一样的:

     SparkSession ss = SparkSession.builder().getOrCreate(); Dataset<Row> dataset = ss.read("s3a://your_bucket/your_path");
  3. Now you have a Dataset<Row> .现在你有一个Dataset<Row> This is useful for SQL-like operations.这对于类似 SQL 的操作很有用。 For your analysis you need to convert it to a Spark RDD:为了您的分析,您需要将其转换为 Spark RDD:

     JavaRDD<Tweet> analyticRdd = dataset.toJavaRDD().map(row -> { return TweetsFactory.tweetFromRow(row); });
  4. So, with analyticRdd you can do your analysis staff.因此,使用analyticRdd您可以做您的分析人员。 Just don't forget to make all your services that work with data Serializable.只是不要忘记使所有使用数据的服务都可序列化。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM