Spark EMR S3处理大量文件

Question

I have around 15000 files (ORC) present in S3 where each file contain few minutes worth of data and size of each file varies between 300-700MB. 我在S3中大约有15000个文件（ORC），其中每个文件包含几分钟的数据，每个文件的大小在300-700MB之间。

Since recursively looping through a directory present in YYYY/MM/DD/HH24/MIN format is expensive, I am creating a file which contain list of all S3 files for a given day (objects_list.txt) and passing this file as input to spark read API 由于递归地循环通过YYYY / MM / DD / HH24 / MIN格式的目录非常昂贵，因此我正在创建一个文件，其中包含给定日期的所有S3文件列表（objects_list.txt），并将此文件作为输入传递给spark读取API

val file_list = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/objects_list.txt"))
val paths: mutable.Set[String] = mutable.Set[String]()
    for (line <- file_list.getLines()) {
      if(line.length > 0 && line.contains("part"))
        paths.add(line.trim)
    }

val eventsDF = spark.read.format("orc").option("spark.sql.orc.filterPushdown","true").load(paths.toSeq: _*)
eventsDF.createOrReplaceTempView("events")

The Size of the cluster is 10 r3.4xlarge machines (workers)(Where Each Node: 120GB RAM and 16 cores) and master is of m3.2xlarge config ( 群集的大小为10台r3.4xlarge机器（工作人员）（每个节点：120GB RAM和16个内核），而主服务器为m3.2xlarge的配置（

The problem which am facing is, spark read was running endlessly and I see only driver working and rest all Nodes aren't doing anything and am not sure why driver is opening each S3 file for reading, because AFAIK spark works lazily so till an action is called reading shouldn't happen, I think it's listing each file and collecting some metadata associated with it. 面临的问题是，spark读取正在无休止地运行，我只能看到驱动程序正在工作，并且所有节点都没有执行任何操作，并且不确定为什么驱动程序打开每个S3文件进行读取，因为AFAIK spark的工作很懒惰，所以直到采取措施为止所谓的阅读不应该发生，我认为它列出了每个文件并收集了与之相关的一些元数据。

But why only Driver is working and rest all Nodes aren't doing anything and how can I make this operation to run in parallel on all worker nodes ? 但是，为什么只有驱动程序在工作，而其余所有节点却什么也不做，我如何使此操作在所有辅助节点上并行运行？

I have come across these articles https://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 and https://gist.github.com/snowindy/d438cb5256f9331f5eec , but here the entire file contents are being read as an RDD, but my use case is depending on the columns being referred only those blocks/columns of data should be fetched from S3 (columnar access given ORC is my storage) . 我遇到过这些文章https://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219和https://gist.github.com/snowindy/d438cb5256f9331f5eec ，但是这里整个文件的内容都以RDD的形式读取，但我的用例取决于所引用的列，仅应从S3提取数据的那些块/列（给定ORC的列访问权限是我的存储）。 Files in S3 have around 130 columns but only 20 fields are being referred and processed using dataframe API's S3中的文件大约有130列，但是使用dataframe API只能引用和处理20个字段

Sample Log Messages:
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=09/min=00/part-r-00199-e4ba7eee-fb98-4d4f-aecc-3f5685ff64a8.zlib.orc' for reading
17/10/08 18:31:15 INFO S3NativeFileSystem: Opening 's3://xxxx/flattenedDataOrc/data=eventsTable/y=2017/m=09/d=20/h=19/min=00/part-r-00023-5e53e661-82ec-4ff1-8f4c-8e9419b2aadc.zlib.orc' for reading

You can see below that only One Executor is running that to driver program on one of the task Nodes(Cluster Mode) and CPU is 0% on rest of the other Nodes(ie Workers) and even after 3-4 hours of processing, the situation is same given huge number of files have to be processed 您可以在下面看到，只有一个执行程序在任务节点之一（群集模式）上运行该驱动程序以运行程序，而其他节点（即工人）的其余部分的CPU则为0％，即使经过3-4小时的处理，鉴于必须处理大量文件，情况相同

Any Pointers on how can I avoid this issue, ie speed up the load and process ? 关于如何避免此问题的任何指针，即加快负载和过程？

Answer 1

There is a solution that can help you based in AWS Glue. 有一个可以帮助您基于AWS Glue的解决方案。

You have a lot of files partitioned in your S3. 您的S3中有很多文件分区。 But you have partitions based in timestamp. 但是您有基于时间戳的分区。 So using glue you can use your objects in S3 like "hive tables" in your EMR. 因此，可以使用胶水在S3中使用对象，例如EMR中的“配置单元表”。

First you need to create a EMR with version 5.8+ and you will be able to see this: 首先，您需要使用5.8+版本创建EMR，您将能够看到以下内容：

You can set up this checking both options. 您可以设置这两个选项。 This will allow to access the AWS Glue Data Catalog. 这将允许访问AWS Glue数据目录。

After this you need to add the your root folder to the AWS Glue Catalog. 之后，您需要将您的根文件夹添加到AWS Glue Catalog。 The fast way to do that is using the Glue Crawler. 做到这一点的最快方法是使用Glue Crawler。 This tool will crawl your data and will create the catalog as you need. 该工具将对您的数据进行爬网，并根据需要创建目录。

I will suggest you to take a look here . 我建议你在这里看看。

After the crawler runs, this will have the metadata of your table in the catalog that you can see at AWS Athena . 搜寻器运行后，它将在您可以在AWS Athena上看到的目录中包含表的元数据。

In Athena you can check if your data was properly identified by the crawler. 在雅典娜，您可以检查搜寻器是否正确识别了您的数据。

This solution will make your spark works close to a real HDFS. 该解决方案将使您的火花接近真实的HDFS。 Due to the metadata will be properly in the Data Catalog. 由于元数据将正确地出现在数据目录中。 And the time you app is taking to find the "indexing" will allow to run the jobs faster. 您的应用在查找“索引编制”上花费的时间将使作业运行得更快。

Working with this here I was able to improve the queries, and working with partitions was much better with glue. 在这里使用它，我可以改善查询，并且使用胶水处理分区要好得多。 So, have a try this probably can help in the performance. 因此，尝试一下这可能会对性能有所帮助。

Spark EMR S3处理大量文件

问题描述

1 个解决方案

解决方案1
4 2017-10-08 23:28:16

Spark EMR S3处理大量文件

问题描述

1 个解决方案

解决方案1 4 2017-10-08 23:28:16

解决方案1
4 2017-10-08 23:28:16