简体   繁体   English

有没有办法为Spark数据帧添加额外的元数据?

[英]Is there a way to add extra metadata for Spark dataframes?

Is it possible to add extra meta data to DataFrame s? 是否可以向DataFrame添加额外的元数据?

Reason 原因

I have Spark DataFrame s for which I need to keep extra information. 我有Spark DataFrame ,我需要保留额外的信息。 Example: A DataFrame , for which I want to "remember" the highest used index in an Integer id column. 示例:一个DataFrame ,我想要“记住”Integer id列中使用率最高的索引。

Current solution 当前解决方案

I use a separate DataFrame to store this information. 我使用单独的DataFrame来存储此信息。 Of course, keeping this information separately is tedious and error-prone. 当然,单独保存这些信息是单调乏味且容易出错的。

Is there a better solution to store such extra information on DataFrame s? 有没有更好的解决方案在DataFrame上存储这些额外的信息?

To expand and Scala-fy nealmcb's answer (the question was tagged scala, not python, so I don't think this answer will be off-topic or redundant), suppose you have a DataFrame: 要扩展和Scala-fy nealmcb的答案(问题标记为scala,而不是python,所以我认为这个答案不会是主题或冗余),假设您有一个DataFrame:

import org.apache.spark.sql
val df = sc.parallelize(Seq.fill(100) { scala.util.Random.nextInt() }).toDF("randInt")

And some way to get the max or whatever you want to memoize on the DataFrame: 还有一些方法可以在DataFrame上获得最大值或任何想要记忆的内容:

val randIntMax = df.rdd.map { case sql.Row(randInt: Int) => randInt }.reduce(math.max)

sql.types.Metadata can only hold strings, booleans, some types of numbers, and other metadata structures. sql.types.Metadata只能包含字符串,布尔值,某些类型的数字和其他元数据结构。 So we have to use a Long: 所以我们必须使用Long:

val metadata = new sql.types.MetadataBuilder().putLong("columnMax", randIntMax).build()

DataFrame.withColumn() actually has an overload that permits supplying a metadata argument at the end, but it's inexplicably marked [private], so we just do what it does — use Column.as(alias, metadata) : DataFrame.withColumn()实际上有一个重载,允许在最后提供元数据参数,但它被莫名其妙地标记为[private],所以我们只是做它做的事情 - 使用Column.as(alias, metadata)

val newColumn = df.col("randInt").as("randInt_withMax", metadata)
val dfWithMax = df.withColumn("randInt_withMax", newColumn)

dfWithMax now has (a column with) the metadata you want! dfWithMax现在有(一列)你想要的元数据!

dfWithMax.schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
> randInt: metadata={}
> randInt_withMax: metadata={"columnMax":2094414111}

Or programmatically and type-safely (sort of; Metadata.getLong() and others do not return Option and may throw a "key not found" exception): 或者以编程方式和类型安全(排序; Metadata.getLong()和其他人不返回Option并且可能抛出“未找到密钥”异常):

dfWithMax.schema("randInt_withMax").metadata.getLong("columnMax")
> res29: Long = 209341992

Attaching the max to a column makes sense in your case, but in the general case of attaching metadata to a DataFrame and not a column in particular, it appears you'd have to take the wrapper route described by the other answers. 在您的情况下将max附加到列是有意义的,但是在将元数据附加到DataFrame而不是特定列的一般情况下,看起来您必须采用其他答案描述的包装器路由。

As of Spark 1.2, StructType schemas have a metadata attribute which can hold an arbitrary mapping / dictionary of information for each Column in a Dataframe. 从Spark 1.2开始,StructType模式具有metadata属性,该属性可以保存Dataframe中每个列的任意映射/信息字典。 Eg (when used with the separate spark-csv library): 例如(当与单独的spark-csv库一起使用时):

customSchema = StructType([
  StructField("cat_id", IntegerType(), True,
    {'description': "Unique id, primary key"}),
  StructField("cat_title", StringType(), True,
    {'description': "Name of the category, with underscores"}) ])

categoryDumpDF = (sqlContext.read.format('com.databricks.spark.csv')
 .options(header='false')
 .load(csvFilename, schema = customSchema) )

f = categoryDumpDF.schema.fields
["%s (%s): %s" % (t.name, t.dataType, t.metadata) for t in f]

["cat_id (IntegerType): {u'description': u'Unique id, primary key'}",
 "cat_title (StringType): {u'description': u'Name of the category, with underscores.'}"]

This was added in [SPARK-3569] Add metadata field to StructField - ASF JIRA , and designed for use in Machine Learning pipelines to track information about the features stored in columns, like categorical/continuous, number categories, category-to-index map. 这在[SPARK-3569]添加元数据字段添加到StructField - ASF JIRA ,并设计用于机器学习管道,以跟踪有关列中存储的功能的信息,如分类/连续,数字类别,类别到索引映射。 See the SPARK-3569: Add metadata field to StructField design document. 请参阅SPARK-3569:向StructField设计文档添加元数据字段

I'd like to see this used more widely, eg for descriptions and documentation of columns, the unit of measurement used in the column, coordinate axis information, etc. 我想更广泛地使用它,例如列的描述和文档,列中使用的测量单位,坐标轴信息等。

Issues include how to appropriately preserve or manipulate the metadata information when the column is transformed, how to handle multiple sorts of metadata, how to make it all extensible, etc. 问题包括如何在转换列时适当地保留或操作元数据信息,如何处理多种元数据,如何使其全部可扩展等。

For the benefit of those thinking of expanding this functionality in Spark dataframes, I reference some analogous discussions around Pandas. 为了那些考虑在Spark数据帧中扩展此功能的人的利益,我引用了一些关于Pandas的类似讨论。

For example, see xray - bring the labeled data power of pandas to the physical sciences which supports metadata for labeled arrays. 例如,请参阅xray - 将pandas的标记数据功能引入支持标记数组元数据的物理科学

And see the discussion of metadata for Pandas at Allow custom metadata to be attached to panel/df/series? 请参阅允许将自定义元数据附加到面板/ df /系列的 Pandas元数据的讨论 · Issue #2485 · pydata/pandas . ·问题#2485·pydata / pandas

See also discussion related to units: ENH: unit of measurement / physical quantities · Issue #10349 · pydata/pandas 另见与单位有关的讨论: ENH:计量单位/物理量·问题#10349·pydata / pandas

If you want to have less tedious work, I think you can add an implicit conversion between DataFrame and your custom wrapper (haven't tested it yet though). 如果你想减少繁琐的工作,我认为你可以在DataFrame和你的自定义包装器之间添加一个隐式转换(尽管还没有测试过)。

   implicit class WrappedDataFrame(val df: DataFrame) {
        var metadata = scala.collection.mutable.Map[String, Long]()

        def addToMetaData(key: String, value: Long) {
           metadata += key -> value
        }
     ...[other methods you consider useful, getters, setters, whatever]...
      }

If the implicit wrapper is in DataFrame's scope, you can just use normal DataFrame as if it was your wrapper, ie.: 如果隐式包装器在DataFrame的范围内,您可以使用普通的DataFrame,就像它是您的包装器一样,即:

df.addtoMetaData("size", 100)

This way also makes your metadata mutable, so you should not be forced to compute it only once and carry it around. 这种方式也会使您的元数据变得可变,因此您不应该只强制计算它一次并随身携带它。

I would store a wrapper around your dataframe. 我会在你的数据框周围存储一个包装器。 For example: 例如:

case class MyDFWrapper(dataFrame: DataFrame, metadata: Map[String, Long])
val maxIndex = df1.agg("index" ->"MAX").head.getLong(0)
MyDFWrapper(df1, Map("maxIndex" -> maxIndex))

A lot of people saw the word "metadata" and went straight to "column metadata". 很多人看到“元数据”这个词,直接进入“列元数据”。 This does not seem to be what you wanted, and was not what I wanted when I had a similar problem. 这似乎不是你想要的,当我遇到类似的问题时,它不是我想要的。 Ultimately, the problem here is that a DataFrame is an immutable data structure that, whenever an operation is performed on it, the data passes on but the rest of the DataFrame does not. 最终,这里的问题是DataFrame是一个不可变的数据结构,无论何时对其执行操作,数据都会传递,但DataFrame的其余部分则不传递。 This means that you can't simply put a wrapper on it, because as soon as you perform an operation you've got a whole new DataFrame (potentially of a completely new type, especially with Scala/Spark's tendencies toward implicit conversions). 这意味着你不能简单地在它上面放一个包装器,因为只要你执行一个操作就有了一个全新的DataFrame(可能是一个全新的类型,特别是Scala / Spark对隐式转换的倾向)。 Finally, if the DataFrame ever escapes its wrapper, there's no way to reconstruct the metadata from the DataFrame. 最后,如果DataFrame逃脱了它的包装器,则无法从DataFrame重建元数据。

I had this problem in Spark Streaming, which focuses on RDDs (the underlying datastructure of the DataFrame as well) and came to one simple conclusion: the only place to store the metadata is in the name of the RDD. 我在Spark Streaming中遇到了这个问题,它专注于RDD(DataFrame的基础数据结构),并得出一个简单的结论:存储元数据的唯一位置是RDD的名称。 An RDD name is never used by the core Spark system except for reporting, so it's safe to repurpose it. 除报告外,核心Spark系统从不使用RDD名称,因此可以安全地重新使用它。 Then, you can create your wrapper based on the RDD name, with an explicit conversion between any DataFrame and your wrapper, complete with metadata. 然后,您可以基于RDD名称创建包装器,在任何 DataFrame和包装器之间进行显式转换,并使用元数据。

Unfortunately, this does still leave you with the problem of immutability and new RDDs being created with every operation. 不幸的是,这仍然会让您遇到不可变性的问题,并且每次操作都会创建新的RDD。 The RDD name (our metadata field) is lost with each new RDD. 每个新RDD都会丢失RDD名称(我们的元数据字段)。 That means you need a way to re-add the name to your new RDD. 这意味着您需要一种方法将名称重新添加到新RDD中。 This can be solved by providing a method that takes a function as an argument. 这可以通过提供一个将函数作为参数的方法来解决。 It can extract the metadata before the function, call the function and get the new RDD/DataFrame, then name it with the metadata: 它可以在函数之前提取元数据,调用函数并获取新的RDD / DataFrame,然后使用元数据命名:

def withMetadata(fn: (df: DataFrame) => DataFrame): MetaDataFrame = {
  val meta = df.rdd.name
  val result = fn(wrappedFrame)
  result.rdd.setName(meta)
  MetaDataFrame(result)
}

Your wrapping class (MetaDataFrame) can provide convenience methods for parsing and setting metadata values, as well as implicit conversions back and forth between Spark DataFrame and MetaDataFrame. 您的包装类(MetaDataFrame)可以提供方便的方法来解析和设置元数据值,以及在Spark DataFrame和MetaDataFrame之间来回隐式转换。 As long as you run all your mutations through the withMetadata method, your metadata will carry along though your entire transformation pipeline. 只要您通过withMetadata方法运行所有突变,您的元数据就会随身携带整个转换管道。 Using this method for every call is a bit of a hassle, yes, but the simple reality is that there is not a first-class metadata concept in Spark. 对每个调用使用这种方法都有点麻烦,是的,但简单的现实是Spark中没有一流的元数据概念。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM