简体   繁体   English

在Apache Spark中对RDD进行分区,使得一个分区包含在一个文件中

[英]Partition RDD in Apache Spark such that one partition consists on one file

I am creating a single RDD of 2.csv files like this 我正在创建一个像这样的2.csv文件的RDD

val combineRDD = sc.textFile("D://release//CSVFilesParellel//*.csv")

Then I want to define custom partition on this RDD such that one partition must contain one file. 然后我想在这个RDD上定义自定义分区,这样一个分区必须包含一个文件。 so that each partition ieone csv file is processed across one node for faster data processing 这样每个分区即一个csv文件在一个节点上处理,以便更快地进行数据处理

Is it possible to write a custom partitioner based on the file size or the number of lines in one file or the end of file character of one file ? 是否可以根据文件大小或一个文件中的行数或一个文件的文件末尾字符编写自定义分区程序?

How do I achieve this ? 我该如何实现这一目标?

The structure of one file looks something like this: 一个文件的结构如下所示:

00-00 00-00

Time(in secs) Measure1 Measure2 Measure3..... Measuren 时间(秒)测量1测量2测量3 .....测量

0 0

0.25 0.25

0.50 0.50

0.75 0.75

1 1

... ...

3600 3600


1.The first row of data contains the hours: mins Each file contains data for 1 hour or 3600secs 1.第一行数据包含小时:分钟每个文件包含1小时或3600秒的数据

2.The first column is a second divided into 4 parts of 250 ms each and data recorded for 250 ms 2.第一列是第二列,分为4个部分,每个部分250毫秒,数据记录250毫秒

  1. For every file I want to add the number of hours: mins to the seconds so that my time looks something like this hours-mins-secs. 对于我想要添加小时数的每个文件:分钟到秒,以便我的时间看起来像这个小时 - 分钟 - 秒。 But the catch is I dont want this process to happen sequentially 但问题是,我不希望这个过程顺序发生

  2. I am using the for-each function for getting each file name -> then creating an RDD of the data in the file and adding the time as specified above. 我使用for-each函数获取每个文件名 - >然后在文件中创建数据的RDD并添加上面指定的时间。

  3. But what I want is that every file should go to one node for processing and calculating time as opposed to data in one file getting distributed across nodes for calculating the time. 但我想要的是每个文件应该进入一个节点进行处理和计算时间,而不是一个文件中的数据分布在节点之间以便计算时间。

Thank you. 谢谢。

Regards, 问候,

Vinay Joglekar Vinay Joglekar

Lets go back to basics. 让我们回到基础。

  1. Philosphy of BigData move Process to the data and not data to process. BigData的Philosphy将流程移动到数据而不是要处理的数据。 this way one increases parallelism and hence I/O throughput 这样可以增加并行性,从而提高I / O吞吐量
  2. One partitioner taking one file will decrease the parallelism not increase. 一个分区器占用一个文件会降低并行性而不会增加。
  3. Simplest way to achieve this is use textInpuTFormat and have your input files compresses by gzip or lzo( no lzo indexing should be done). 最简单的方法是使用textInpuTFormat,并通过gzip或lzo压缩输入文件(不应该使用lzo索引)。
  4. Gzip being non splittable will force one file going to one partition but this in no way will help in anyKind throughput increase Gzip不可拆分将强制一个文件进入一个分区,但这绝不会有助于anyKind吞吐量的增加

  5. To Write Custom Input Format Extend from FileInputFormat and give your splitlogic and recordReader logic. 编写自定义输入格式从FileInputFormat扩展并提供splitlogic和recordReader逻辑。

To use custom input format in spark please follow 要在spark中使用自定义输入格式,请关注

http://bytepadding.com/big-data/spark/combineparquetfileinputformat/ http://bytepadding.com/big-data/spark/combineparquetfileinputformat/

Simple answer without questioning why you're doing this. 简单的回答,而不会质疑你为什么这样做。 Load the files separately so you know the filename being loaded 单独加载文件,以便知道正在加载的文件名

// create firstRDD containing a new attribute `filename=first.csv`
val firstRDD = sc.textFile("D://release//CSVFilesParellel//first.csv")
    .map(line => new CsvRecord(line))

// create secondRDD containing a new attribute `filename=second.csv`
val secondRDD = sc.textFile("D://release//CSVFilesParellel//second.csv")
    .map(line => new CsvRecord(line))

// now create a pair RDD and re-partition on the filename
val partitionRDD = firstRDD.union(secondRDD)
    .map(csvRecord => (csvRecord.filename,csvRecord))
    .partitionBy(customFilenamePartitioner)

The following quote from here 以下引用来自此处

To implement your customFilenamePartitioner, you need to subclass the org.apache.spark.Partitioner class and implement three types of methods : 要实现customFilenamePartitioner,您需要继承org.apache.spark.Partitioner类并实现三种类型的方法:

NumPartitions : Int, which returns the number of partitions that you will create it as per the requirements. NumPartitions:Int,返回根据要求创建的分区数。

getPartition(key: Any) : Int, which returns the partition ID ranging from (0 to numPartitions-1) for a given key. getPartition(key:Any):Int,返回给定键的(0到numPartitions-1)范围内的分区ID。

equals(), : the standard Java programming equality method. equals(),:标准的Java编程相等方法。 This is important to implement because Spark application will need to test your Partitioner object against other instances by its own terms when it decides whether two of your RDDs are partitioned the same way as it is required. 这一点很重要,因为Spark应用程序在决定是否以相同的方式对两个RDD进行分区时,需要根据自己的术语对其他实例测试Partitioner对象。

Keep in mind that a repartition will most likely trigger an expensive shuffle, so unless you are going to be repeatedly querying this newly partitioned RDD you may be better off solving your problem another way. 请记住,重新分区很可能会触发昂贵的随机播放,因此除非您要反复查询这个新分区的RDD,否则您最好以另一种方式解决问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM