分布式读取Spark中的CSV文件

Question

I am developing a Spark processing framework which reads large CSV files, loads them into RDD's, performs some transformations and at the end saves some statistics.我正在开发一个 Spark 处理框架，它读取 CSV 个大文件，将它们加载到 RDD 中，执行一些转换，最后保存一些统计数据。

The CSV files in question are around 50GB on average.有问题的 CSV 个文件平均约为 50GB。 I'm using Spark 2.0.我正在使用星火 2.0。

My question is:我的问题是：

When I load the files using sparkContext.textFile() function, does the file needs to be stored in the memory of the driver first, and then it is distributed to the workers (thus requiring a rather large amount of memory on the driver)?当我使用 sparkContext.textFile() function 加载文件时，文件是否需要先存储在驱动程序的 memory 中，然后再分发给工作人员（因此需要相当多的驱动程序 memory ）？ Or the file is read "in parallel" by every worker, in a way none of them needs to store the whole file, and the driver acts only as a "manager"?或者文件由每个工作人员“并行”读取，在某种程度上他们都不需要存储整个文件，并且驱动程序仅充当“经理”？

Thanks in advance提前致谢

Answer 1

When you define the reading, the file would be divided to partitions based on your parallelism scheme and the instructions would be sent to the workers.当您定义读取时，文件将根据您的并行方案划分为分区，并将指令发送给工作人员。 Then the file is read directly by the workers from the filesystem (hence the need for a distributed filesystem available to all the nodes such as HDFS).然后工作人员直接从文件系统读取文件（因此需要一个分布式文件系统可用于所有节点，例如 HDFS）。

As a side note, it would be much better to read it to a dataframe using spark.read.csv and not in RDD.作为旁注，使用 spark.read.csv 而不是在 RDD 中将其读取到数据帧会好得多。 This would take less memory and would allow spark to optimize your queries.这将占用更少的内存，并允许 spark 优化您的查询。

UPDATE更新

In the comment, it was asked what would happen if the file system was not distributed and the file would be located on only one machine.在评论中，有人问如果文件系统没有分发并且文件只位于一台机器上会发生什么。 The answer is that If you have more than 1 machine it will most likely fail.答案是，如果您有超过 1 台机器，它很可能会失败。

When you do the sparkContext.textFile, nothing is actually read, it just tells spark WHAT you want to read.当您执行 sparkContext.textFile 时，实际上没有读取任何内容，它只是告诉 spark 您想要读取的内容。 Then you do some transformation on it and still nothing is read because you are defining a plan.然后您对其进行一些转换，但仍然没有读取任何内容，因为您正在定义一个计划。 Once you perform an action (eg collect) then the actual processing begins.一旦您执行了一个操作（例如收集），那么实际的处理就开始了。 Spark would divide the job into tasks and send them to the executors. Spark 会将作业划分为任务并将它们发送给执行者。 The executors (which might be on the master node or on worker nodes) would then attempt to read portions of the file.执行器（可能在主节点或工作节点上）然后会尝试读取文件的一部分。 The problem is that any executor NOT on the master node would look for the file and fail to find it causing the tasks to fail.问题在于，不在主节点上的任何执行程序都会查找该文件，但无法找到它，从而导致任务失败。 Spark would retry several times (I believe the default is 4) and then fail completely. Spark 会重试几次（我相信默认值为 4）然后完全失败。

Of course if you have just one node then all executors will see the file and everything would be fine.当然，如果您只有一个节点，那么所有执行程序都会看到该文件，一切都会好起来的。 Also in theory, it could be that the tasks would fail on worker and then rerun on the master and succeed there but in any case the workers would not do any work unless they see a copy of the file.同样在理论上，任务可能会在 worker 上失败，然后在 master 上重新运行并在那里成功，但在任何情况下，worker 都不会做任何工作，除非他们看到文件的副本。

You can solve this by copying the file to the exact same path in all nodes or by using any kind of distributed file system (even NFS shares are fine).您可以通过将文件复制到所有节点中完全相同的路径或使用任何类型的分布式文件系统（甚至 NFS 共享也可以）来解决此问题。

Of course you can always work on a single node but then you would not be taking advantage of spark's scalability.当然，您始终可以在单个节点上工作，但这样您就无法利用 Spark 的可扩展性。

Answer 2

CSV (with no gz compression) is splitteable, so Spark split using default block size of 128M. CSV（没有 gz 压缩）是可拆分的，因此 Spark 使用默认块大小 128M 进行拆分。

This means, 50Go => 50Go/128Mo = 50*1024/128 = 400 blocks.这意味着，50Go => 50Go/128Mo = 50*1024/128 = 400 个区块。

block 0 = range 0 - 134217728块 0 = 范围 0 - 134217728

block 1 = range 134217728 - 268435456块 1 = 范围 134217728 - 268435456

etc.等等

Of course, last line in each text block (class PartitionedFile) will cross 128M boudary, but spark will read few chars more until end of next line.当然，每个文本块（类 PartitionedFile）中的最后一行将跨越 128M 边界，但 spark 将读取更多字符直到下一行结束。 Spark also ignore first line of block when it is not at pos==0当它不在 pos==0 时，Spark 也会忽略块的第一行

Internally spark use Hadoop class LineRecordReader (shaded in org.apache.hadoop.shaded.org.apache.hadoop.mapreduce.lib.input.LineRecordReader )内部 spark 使用 Hadoop class LineRecordReader（在 org.apache.hadoop.shaded.org.apache.hadoop.8896063lib.88088.8896063lib.80088.8896063lib.8088 中加阴影）

cf source code for positionning at first line of a block... cf 用于在块的第一行定位的源代码...

// If this is not the first split, we always throw away first record
// because we always (except the last split) read one extra line in
// next() method.
if (start != 0) {
    start += in.readLine(new Text(), 0, maxBytesToConsume(start));
}

分布式读取Spark中的CSV文件

问题描述

2 个解决方案

解决方案1
15 已采纳 2017-02-11 11:29:35

解决方案2
0 2022-11-24 15:15:35

分布式读取Spark中的CSV文件

问题描述

2 个解决方案

解决方案1 15 已采纳 2017-02-11 11:29:35

解决方案2 0 2022-11-24 15:15:35

解决方案1
15 已采纳 2017-02-11 11:29:35

解决方案2
0 2022-11-24 15:15:35