简体   繁体   English

Spark Scala数据框-用来自另一个数据框的值替换/联接列值(但已转置)

[英]Spark Scala Dataframe - replace/join column values with values from another dataframe (but is transposed)

I have a table with ~300 columns filled with characters (stored as String): 我有一个表,其中约300列填充有字符(存储为String):

valuesDF:

| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| U             | C               | ...
| U             | E               | ...
| I             | B               | ...
| C             | U               | ...
| ...           | ...             | ...

I have a Data Summary, which maps the characters onto their actual meaning. 我有一个数据摘要,它将字符映射到它们的实际含义。 It is in this form: 格式如下:

summaryDF:

| Field            | Value | ValueDesc     |
|------------------|-------|---------------|
|  FavouriteBeer   |   U   |  Unknown      |
|  FavouriteBeer   |   C   |  Carlsberg    |
|  FavouriteBeer   |   I   |  InnisAndGunn |
|  FavouriteBeer   |   D   |  DoomBar      |
|  FavouriteCheese |   C   |  Cheddar      |
|  FavouriteCheese |   E   |  Emmental     |
|  FavouriteCheese |   B   |  Brie         |
|  FavouriteCheese |   U   |  Unknown      |
|  ...             |  ...  |    ...        |

I want to programmatically replace the character values of each column in valuesDF with the Value Descriptions from summaryDF . 我想以编程方式替换每一列中的字符值valuesDF从值描述summaryDF This is the result I'm looking for: 这是我要寻找的结果:

finalDF:

| FavouriteBeer | FavouriteCheese | ...
|---------------|-----------------|--------
| Unknown       | Cheddar         | ...
| Unknown       | Emmental        | ...
| InnisAndGunn  | Brie            | ...
| Carlsberg     | Unknown         | ...
| ...           | ...             | ...

As there are ~300 columns, I'm not keen to type out withColumn methods for each one. 由于大约有300列,因此我不希望为每个列都使用withColumn方法。

Unfortunately I'm a bit of a novice when it comes to programming for Spark, although I've picked up enough to get by over the last 2 months. 不幸的是,关于Spark编程我还是个新手,尽管在过去的2个月中我已经足够学习。


What I'm pretty sure I need to do is something along the lines of: 我很确定我需要做的事情大致如下:

  1. valuesDF.columns.foreach { col => ...... } to iterate over each column valuesDF.columns.foreach { col => ...... }遍历每列
  2. Filter summaryDF on Field using col String value 过滤summaryDFField使用col字符串值
  3. Left join summaryDF onto valuesDF based on current column LEFT JOIN summaryDFvaluesDF基于当前列
  4. withColumn to replace the original character code column from valuesDF with new description column withColumn从代替原来的字符代码列valuesDF新的说明列
  5. Assign new DF as a var 将新的DF分配为var
  6. Continue loop 继续循环

However, trying this gave me Cartesian product error (I made sure to define the join as "left" ). 但是,尝试这样做会使我产生笛卡尔乘积错误(我确保将连接定义为"left" )。

I tried and failed to pivot summaryDF (as there are no aggregations to do??) then join both dataframes together. 我尝试并未summaryDF旋转(因为没有要执行的聚合??),然后将两个数据帧连接在一起。

This is the sort of thing I've tried, and always getting a NullPointerException . 这是我尝试过的方法,并且总是得到NullPointerException I know this is really not the right way to do this, and can see why I'm getting Null Pointer... but I'm really stuck and reverting back to old, silly & bad Python habits in desperation. 我知道这确实不是执行此操作的正确方法,并且可以看到为什么我得到了Null指针...但是我真的陷入了困境,无奈之下又回到了旧的,愚蠢的和不良的Python习惯。

var valuesDF = sourceDF
// I converted summaryDF to a broadcasted RDD 
// because its small and a "constant" lookup table
summaryBroadcast
 .value
 .foreach{ x =>

   // searchValue = Value (e.g. `U`), 
   // replaceValue = ValueDescription (e.g. `Unknown`), 

   val field = x(0).toString
   val searchValue = x(1).toString
   val replaceValue = x(2).toString

   // error catching as summary data does not exactly mapping onto field names
   // the joys of business people working in Excel...
   try {
     // I'm using regexp_replace because I'm lazy
     valuesDF = valuesDF
       .withColumn( attribute, regexp_replace(col(attribute), searchValue, replaceValue ))
   }
   catch {case _: Exception =>
     null
   }
}

Any ideas? 有任何想法吗? Advice? 建议吗? Thanks. 谢谢。

First, we'll need a function that executes a join of valuesDf with summaryDf by Value and the respective pair of Favourite* and Field : 首先,我们需要一个函数,该函数通过Value以及相应的Favourite*Field对执行valuesDfsummaryDf

private def joinByColumn(colName: String, sourceDf: DataFrame): DataFrame = {
  sourceDf.as("src") // alias it to help selecting appropriate columns in the result
          // the join 
          .join(summaryDf, $"Value" === col(colName) && $"Field" === colName, "left")
          // we do not need the original `Favourite*` column, so drop it
          .drop(colName)
          // select all previous columns, plus the one that contains the match
          .select("src.*", "ValueDesc")
          // rename the resulting column to have the name of the source one
          .withColumnRenamed("ValueDesc", colName)
}

Now, to produce the target result we can iterate on the names of the columns to match: 现在,要产生目标结果,我们可以迭代要匹配的列的名称:

val result = Seq("FavouriteBeer", 
                 "FavouriteCheese").foldLeft(valuesDF) { 
                    case(df, colName) => joinByColumn(colName, df) 
                 }

result.show()
+-------------+---------------+
|FavouriteBeer|FavouriteCheese|
+-------------+---------------+
|      Unknown|        Cheddar|
|      Unknown|       Emmental|
| InnisAndGunn|           Brie|
|    Carlsberg|        Unknown|
+-------------+---------------+

In case a value from valuesDf does not match with anything in summaryDf , the resulting cell in this solution will contain null . 如果从价值valuesDf不匹配,在任何summaryDf ,在该解决方案中获得的细胞将包含null If you want just to replace it with Unknown value, instead of .select and .withColumnRenamed lines above use: 如果您只想将其替换为Unknown值,请使用上面的.select.withColumnRenamed行代替:

.withColumn(colName, when($"ValueDesc".isNotNull, $"ValueDesc").otherwise(lit("Unknown")))
.select("src.*", colName)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM