简体   繁体   English

通过键Spark写入多个输出-一个Spark作业

[英]Write to multiple outputs by key Spark - one Spark job

How can you write to multiple outputs dependent on the key using Spark in a single Job. 如何在单个Job中使用Spark写入取决于键的多个输出。

Related: Write to multiple outputs by key Scalding Hadoop, one MapReduce Job 相关: 通过键扩展Hadoop(一个MapReduce作业)写入多个输出

Eg 例如

sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c")))
.writeAsMultiple(prefix, compressionCodecOption)

would ensure cat prefix/1 is 将确保cat prefix/1

a
b

and cat prefix/2 would be cat prefix/2将是

c

EDIT: I've recently added a new answer that includes full imports, pimp and compression codec, see https://stackoverflow.com/a/46118044/1586965 , which may be helpful in addition to the earlier answers. 编辑:我最近添加了一个新的答案,其中包括完整的导入,pimp和压缩编解码器,请参阅https://stackoverflow.com/a/46118044/1586965 ,除了较早的答案外,它可能还会有所帮助。

If you use Spark 1.4+, this has become much, much easier thanks to the DataFrame API . 如果使用Spark 1.4+,这要归功于DataFrame API ,它变得非常容易。 (DataFrames were introduced in Spark 1.3, but partitionBy() , which we need, was introduced in 1.4 .) (DataFrames是在Spark 1.3中引入的,而我们所需的partitionBy()在1.4引入的 。)

If you're starting out with an RDD, you'll first need to convert it to a DataFrame: 如果从RDD开始,则首先需要将其转换为DataFrame:

val people_rdd = sc.parallelize(Seq((1, "alice"), (1, "bob"), (2, "charlie")))
val people_df = people_rdd.toDF("number", "name")

In Python, this same code is: 在Python中,相同的代码是:

people_rdd = sc.parallelize([(1, "alice"), (1, "bob"), (2, "charlie")])
people_df = people_rdd.toDF(["number", "name"])

Once you have a DataFrame, writing to multiple outputs based on a particular key is simple. 一旦有了DataFrame,就可以根据特定键写入多个输出,这很简单。 What's more -- and this is the beauty of the DataFrame API -- the code is pretty much the same across Python, Scala, Java and R: 更重要的是-这就是DataFrame API的优点-Python,Scala,Java和R中的代码几乎相同:

people_df.write.partitionBy("number").text("people")

And you can easily use other output formats if you want: 如果需要,您可以轻松使用其他输出格式:

people_df.write.partitionBy("number").json("people-json")
people_df.write.partitionBy("number").parquet("people-parquet")

In each of these examples, Spark will create a subdirectory for each of the keys that we've partitioned the DataFrame on: 在上述每个示例中,Spark将为我们对DataFrame进行分区的每个键创建一个子目录:

people/
  _SUCCESS
  number=1/
    part-abcd
    part-efgh
  number=2/
    part-abcd
    part-efgh

I would do it like this which is scalable 我会这样做,可扩展

import org.apache.hadoop.io.NullWritable

import org.apache.spark._
import org.apache.spark.SparkContext._

import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

class RDDMultipleTextOutputFormat extends MultipleTextOutputFormat[Any, Any] {
  override def generateActualKey(key: Any, value: Any): Any = 
    NullWritable.get()

  override def generateFileNameForKeyValue(key: Any, value: Any, name: String): String = 
    key.asInstanceOf[String]
}

object Split {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Split" + args(1))
    val sc = new SparkContext(conf)
    sc.textFile("input/path")
    .map(a => (k, v)) // Your own implementation
    .partitionBy(new HashPartitioner(num))
    .saveAsHadoopFile("output/path", classOf[String], classOf[String],
      classOf[RDDMultipleTextOutputFormat])
    spark.stop()
  }
}

Just saw similar answer above, but actually we don't need customized partitions. 刚刚在上面看到了类似的答案,但实际上我们不需要自定义分区。 The MultipleTextOutputFormat will create file for each key. MultipleTextOutputFormat将为每个键创建文件。 It is ok that multiple record with same keys fall into the same partition. 可以将具有相同键的多个记录放入同一分区。

new HashPartitioner(num), where the num is the partition number you want. 新的HashPartitioner(num),其中num是所需的分区号。 In case you have a big number of different keys, you can set number to big. 如果您有大量不同的键,可以将数字设置为big。 In this case, each partition will not open too many hdfs file handlers. 在这种情况下,每个分区将不会打开太多的hdfs文件处理程序。

If you potentially have many values for a given key, I think the scalable solution is to write out one file per key per partition. 如果给定键可能有很多值,我认为可伸缩的解决方案是每个分区的每个键写一个文件。 Unfortunately there is no built-in support for this in Spark, but we can whip something up. 不幸的是,Spark中没有对此的内置支持,但是我们可以提高一些。

sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c")))
  .mapPartitionsWithIndex { (p, it) =>
    val outputs = new MultiWriter(p.toString)
    for ((k, v) <- it) {
      outputs.write(k.toString, v)
    }
    outputs.close
    Nil.iterator
  }
  .foreach((x: Nothing) => ()) // To trigger the job.

// This one is Local, but you could write one for HDFS
class MultiWriter(suffix: String) {
  private val writers = collection.mutable.Map[String, java.io.PrintWriter]()
  def write(key: String, value: Any) = {
    if (!writers.contains(key)) {
      val f = new java.io.File("output/" + key + "/" + suffix)
      f.getParentFile.mkdirs
      writers(key) = new java.io.PrintWriter(f)
    }
    writers(key).println(value)
  }
  def close = writers.values.foreach(_.close)
}

(Replace PrintWriter with your choice of distributed filesystem operation.) (将PrintWriter替换为您选择的分布式文件系统操作。)

This makes a single pass over the RDD and performs no shuffle. 这样就可以在RDD上进行一次遍历,而不会进行随机播放。 It gives you one directory per key, with a number of files inside each. 它为每个键提供一个目录,每个目录中包含许多文件。

This includes the codec as requested, necessary imports, and pimp as requested. 这包括请求的编解码器,必要的导入和请求的皮条客。

import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SQLContext

// TODO Need a macro to generate for each Tuple length, or perhaps can use shapeless
implicit class PimpedRDD[T1, T2](rdd: RDD[(T1, T2)]) {
  def writeAsMultiple(prefix: String, codec: String,
                      keyName: String = "key")
                     (implicit sqlContext: SQLContext): Unit = {
    import sqlContext.implicits._

    rdd.toDF(keyName, "_2").write.partitionBy(keyName)
    .format("text").option("codec", codec).save(prefix)
  }
}

val myRdd = sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c")))
myRdd.writeAsMultiple("prefix", "org.apache.hadoop.io.compress.GzipCodec")

One subtle difference to the OP is that it will prefix <keyName>= to the directory names. 与OP的一个细微差别是它将<keyName>=作为目录名称的前缀。 Eg 例如

myRdd.writeAsMultiple("prefix", "org.apache.hadoop.io.compress.GzipCodec")

Would give: 将给出:

prefix/key=1/part-00000
prefix/key=2/part-00000

where prefix/my_number=1/part-00000 would contain the lines a and b , and prefix/my_number=2/part-00000 would contain the line c . 其中prefix/my_number=1/part-00000将包含行ab ,而prefix/my_number=2/part-00000将包含行c

And

myRdd.writeAsMultiple("prefix", "org.apache.hadoop.io.compress.GzipCodec", "foo")

Would give: 将给出:

prefix/foo=1/part-00000
prefix/foo=2/part-00000

It should be clear how to edit for parquet . 应该清楚如何编辑parquet

Finally below is an example for Dataset , which is perhaps nicer that using Tuples. 最后,下面是Dataset的示例,它也许比使用Tuples更好。

implicit class PimpedDataset[T](dataset: Dataset[T]) {
  def writeAsMultiple(prefix: String, codec: String, field: String): Unit = {
    dataset.write.partitionBy(field)
    .format("text").option("codec", codec).save(prefix)
  }
}

I have a similar need and found an way. 我有类似的需求,找到了一种方法。 But it has one drawback (which is not a problem for my case): you need to re-partition you data with one partition per output file. 但这有一个缺点(对我而言,这不是问题):您需要使用每个输出文件一个分区对数据进行重新分区。

To partition in this way it generally requires to know beforehand how many files the job will output and find a function that will map each key to each partition. 要以这种方式进行分区,通常需要事先知道作业将输出多少文件,并找到将每个键映射到每个分区的函数。

First let's create our MultipleTextOutputFormat-based class: 首先,让我们创建基于MultipleTextOutputFormat的类:

import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat

class KeyBasedOutput[T >: Null, V <: AnyRef] extends MultipleTextOutputFormat[T , V] {
  override def generateFileNameForKeyValue(key: T, value: V, leaf: String) = {
    key.toString
  }
  override protected def generateActualKey(key: T, value: V) = {
    null
  }
}

With this class Spark will get a key from a partition (the first/last, I guess) and name the file with this key, so it's not good to mix multiple keys on the same partition. 有了这个类,Spark将从一个分区(我想是第一个/最后一个)中获取一个密钥,并用该密钥命名文件,因此在同一分区上混合多个密钥是不好的。

For your example, you will require a custom partitioner. 对于您的示例,您将需要一个自定义分区程序。 This will do the job: 这将完成工作:

import org.apache.spark.Partitioner

class IdentityIntPartitioner(maxKey: Int) extends Partitioner {
  def numPartitions = maxKey

  def getPartition(key: Any): Int = key match {
    case i: Int if i < maxKey => i
  }
}

Now let's put everything together: 现在,让我们将所有内容放在一起:

val rdd = sc.makeRDD(Seq((1, "a"), (1, "b"), (2, "c"), (7, "d"), (7, "e")))

// You need to know the max number of partitions (files) beforehand
// In this case we want one partition per key and we have 3 keys,
// with the biggest key being 7, so 10 will be large enough
val partitioner = new IdentityIntPartitioner(10)

val prefix = "hdfs://.../prefix"

val partitionedRDD = rdd.partitionBy(partitioner)

partitionedRDD.saveAsHadoopFile(prefix,
    classOf[Integer], classOf[String], classOf[KeyBasedOutput[Integer, String]])

This will generate 3 files under prefix (named 1, 2 and 7), processing everything in one pass. 这将在前缀(名称分别为1、2和7)下生成3个文件,一次处理所有文件。

As you can see, you need some knowledge about your keys to be able to use this solution. 如您所见,您需要有关密钥的一些知识才能使用此解决方案。

For me it was easier because I needed one output file for each key hash and the number of files was under my control, so I could use the stock HashPartitioner to do the trick. 对我来说,这很容易,因为每个密钥哈希都需要一个输出文件,并且文件的数量在我的控制之下,所以我可以使用常规的HashPartitioner来完成此任务。

I was in need of the same thing in Java. 我在Java中也需要同样的东西。 Posting my translation of Zhang Zhan's Scala answer to Spark Java API users: 将我对Zhang Zhan的Scala答案的翻译发布给Spark Java API用户:

import org.apache.hadoop.mapred.lib.MultipleTextOutputFormat;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.util.Arrays;


class RDDMultipleTextOutputFormat<A, B> extends MultipleTextOutputFormat<A, B> {

    @Override
    protected String generateFileNameForKeyValue(A key, B value, String name) {
        return key.toString();
    }
}

public class Main {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf()
                .setAppName("Split Job")
                .setMaster("local");
        JavaSparkContext sc = new JavaSparkContext(conf);
        String[] strings = {"Abcd", "Azlksd", "whhd", "wasc", "aDxa"};
        sc.parallelize(Arrays.asList(strings))
                // The first character of the string is the key
                .mapToPair(s -> new Tuple2<>(s.substring(0,1).toLowerCase(), s))
                .saveAsHadoopFile("output/", String.class, String.class,
                        RDDMultipleTextOutputFormat.class);
        sc.stop();
    }
}

saveAsText() and saveAsHadoop(...) are implemented based on the RDD data, specifically by the method: PairRDD.saveAsHadoopDataset which takes the data from the PairRdd where it's executed. saveRDC数据是基于RDD数据实现的,特别是通过以下方法实现的: PairRDD.saveAsHadoopDataset ,该方法从PairRdd中获取执行数据。 I see two possible options: If your data is relatively small in size, you could save some implementation time by grouping over the RDD, creating a new RDD from each collection and using that RDD to write the data. 我看到两个可能的选择:如果数据较小,则可以通过对RDD进行分组,从每个集合中创建一个新的RDD并使用该RDD写入数据来节省一些实现时间。 Something like this: 像这样:

val byKey = dataRDD.groupByKey().collect()
val rddByKey = byKey.map{case (k,v) => k->sc.makeRDD(v.toSeq)}
val rddByKey.foreach{ case (k,rdd) => rdd.saveAsText(prefix+k}

Note that it will not work for large datasets b/c the materialization of the iterator at v.toSeq might not fit in memory. 请注意,它不适用于大型数据集b / c, v.toSeq处的迭代器的v.toSeq可能不适合内存。

The other option I see, and actually the one I'd recommend in this case is: roll your own, by directly calling the hadoop/hdfs api. 我看到的另一个选项(在这种情况下实际上是我推荐的)是:通过直接调用hadoop / hdfs api自己滚动。

Here's a discussion I started while researching this question: How to create RDDs from another RDD? 这是我在研究此问题时开始的讨论: 如何从另一个RDD创建RDD?

I had a similar use case where I split the input file on Hadoop HDFS into multiple files based on a key (1 file per key). 我有一个类似的用例,其中我根据一个密钥(每个密钥1个文件)将Hadoop HDFS上的输入文件拆分为多个文件。 Here is my scala code for spark 这是我的火花的scala代码

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

val hadoopconf = new Configuration();
val fs = FileSystem.get(hadoopconf);

@serializable object processGroup {
    def apply(groupName:String, records:Iterable[String]): Unit = {
        val outFileStream = fs.create(new Path("/output_dir/"+groupName))
        for( line <- records ) {
                outFileStream.writeUTF(line+"\n")
            }
        outFileStream.close()
    }
}
val infile = sc.textFile("input_file")
val dateGrouped = infile.groupBy( _.split(",")(0))
dateGrouped.foreach( (x) => processGroup(x._1, x._2))

I have grouped the records based on key. 我已经根据密钥对记录进行了分组。 The values for each key is written to separate file. 每个密钥的值都写入单独的文件。

good news for python user in the case you have multi columns and you want to save all the other columns not partitioned in csv format which will failed if you use "text" method as Nick Chammas' suggestion . 如果您有多列并且要保存所有其他未按csv格式分区的列,这对于python用户来说是个好消息,如果您使用Nick的建议使用“文本”方法,则这将失败。

people_df.write.partitionBy("number").text("people") 

error message is "AnalysisException: u'Text data source supports only a single column, and you have 2 columns.;'" 错误消息是“ AnalysisException:u'Text数据源仅支持一列,而您有2列。

In spark 2.0.0 (my test enviroment is hdp's spark 2.0.0) package "com.databricks.spark.csv" is now integrated , and it allow us save text file partitioned by only one column, see the example blow: 在spark 2.0.0中(我的测试环境是hdp的spark 2.0.0),软件包“ com.databricks.spark.csv”现在集成了,它允许我们保存仅一列分区的文本文件,请参见示例打击:

people_rdd = sc.parallelize([(1,"2016-12-26", "alice"),
                             (1,"2016-12-25", "alice"),
                             (1,"2016-12-25", "tom"), 
                             (1, "2016-12-25","bob"), 
                             (2,"2016-12-26" ,"charlie")])
df = people_rdd.toDF(["number", "date","name"])

df.coalesce(1).write.partitionBy("number").mode("overwrite").format('com.databricks.spark.csv').options(header='false').save("people")

[root@namenode people]# tree
.
├── number=1
│?? └── part-r-00000-6bd1b9a8-4092-474a-9ca7-1479a98126c2.csv
├── number=2
│?? └── part-r-00000-6bd1b9a8-4092-474a-9ca7-1479a98126c2.csv
└── _SUCCESS

[root@namenode people]# cat number\=1/part-r-00000-6bd1b9a8-4092-474a-9ca7-1479a98126c2.csv
2016-12-26,alice
2016-12-25,alice
2016-12-25,tom
2016-12-25,bob
[root@namenode people]# cat number\=2/part-r-00000-6bd1b9a8-4092-474a-9ca7-1479a98126c2.csv
2016-12-26,charlie

In my spark 1.6.1 enviroment ,the code didn't throw any error,however ther is only one file generated. 在我的Spark 1.6.1环境中,代码没有引发任何错误,但是仅生成了一个文件。 it's not partitioned by two folders. 它没有被两个文件夹分区。

Hope this can help . 希望这会有所帮助。

I had a similar use case. 我有一个类似的用例。 I resolved it in Java by writing two custom classes implemeting MultipleTextOutputFormat and RecordWriter . 我通过编写两个实现MultipleTextOutputFormatRecordWriter自定义类在Java中解决了该RecordWriter

My input was a JavaPairRDD<String, List<String>> and I wanted to store it in a file named by its key, with all the lines contained in its value. 我的输入是JavaPairRDD<String, List<String>> ,我想将其存储在以其键命名的文件中,其值中包含所有行。

Here is the code for my MultipleTextOutputFormat implementation 这是我的MultipleTextOutputFormat实现的代码

class RDDMultipleTextOutputFormat<K, V> extends MultipleTextOutputFormat<K, V> {

    @Override
    protected String generateFileNameForKeyValue(K key, V value, String name) {
        return key.toString(); //The return will be used as file name
    }

    /** The following 4 functions are only for visibility purposes                 
    (they are used in the class MyRecordWriter) **/
    protected String generateLeafFileName(String name) {
        return super.generateLeafFileName(name);
    }

    protected V generateActualValue(K key, V value) {
        return super.generateActualValue(key, value);
    }

    protected String getInputFileBasedOutputFileName(JobConf job,     String name) {
        return super.getInputFileBasedOutputFileName(job, name);
        }

    protected RecordWriter<K, V> getBaseRecordWriter(FileSystem fs, JobConf job, String name, Progressable arg3) throws IOException {
        return super.getBaseRecordWriter(fs, job, name, arg3);
    }

    /** Use my custom RecordWriter **/
    @Override
    RecordWriter<K, V> getRecordWriter(final FileSystem fs, final JobConf job, String name, final Progressable arg3) throws IOException {
    final String myName = this.generateLeafFileName(name);
        return new MyRecordWriter<K, V>(this, fs, job, arg3, myName);
    }
} 

Here is the code for my RecordWriter implementation. 这是我的RecordWriter实现的代码。

class MyRecordWriter<K, V> implements RecordWriter<K, V> {

    private RDDMultipleTextOutputFormat<K, V> rddMultipleTextOutputFormat;
    private final FileSystem fs;
    private final JobConf job;
    private final Progressable arg3;
    private String myName;

    TreeMap<String, RecordWriter<K, V>> recordWriters = new TreeMap();

    MyRecordWriter(RDDMultipleTextOutputFormat<K, V> rddMultipleTextOutputFormat, FileSystem fs, JobConf job, Progressable arg3, String myName) {
        this.rddMultipleTextOutputFormat = rddMultipleTextOutputFormat;
        this.fs = fs;
        this.job = job;
        this.arg3 = arg3;
        this.myName = myName;
    }

    @Override
    void write(K key, V value) throws IOException {
        String keyBasedPath = rddMultipleTextOutputFormat.generateFileNameForKeyValue(key, value, myName);
        String finalPath = rddMultipleTextOutputFormat.getInputFileBasedOutputFileName(job, keyBasedPath);
        Object actualValue = rddMultipleTextOutputFormat.generateActualValue(key, value);
        RecordWriter rw = this.recordWriters.get(finalPath);
        if(rw == null) {
            rw = rddMultipleTextOutputFormat.getBaseRecordWriter(fs, job, finalPath, arg3);
            this.recordWriters.put(finalPath, rw);
        }
        List<String> lines = (List<String>) actualValue;
        for (String line : lines) {
            rw.write(null, line);
        }
    }

    @Override
    void close(Reporter reporter) throws IOException {
        Iterator keys = this.recordWriters.keySet().iterator();

        while(keys.hasNext()) {
            RecordWriter rw = (RecordWriter)this.recordWriters.get(keys.next());
            rw.close(reporter);
        }

        this.recordWriters.clear();
    }
}

Most of the code is exactly the same than in FileOutputFormat . 大多数代码与FileOutputFormat中的代码完全相同。 The only difference is those few lines 唯一的区别是那几行

List<String> lines = (List<String>) actualValue;
for (String line : lines) {
    rw.write(null, line);
}

These lines allowed me to write each line of my input List<String> on the file. 这些行允许我将输入List<String>每一行写在文件上。 The first argument of the write function is set to null in order to avoid writting the key on each line. 为了避免将密钥写在每一行上, write函数的第一个参数设置为null

To finish, I only need to do this call to write my files 最后,我只需要执行此调用即可写入文件

javaPairRDD.saveAsHadoopFile(path, String.class, List.class, RDDMultipleTextOutputFormat.class);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM