简体   繁体   English

关于如何在 Scala 中使用随机值向现有 DataFrame 添加新列

[英]About how to add a new column to an existing DataFrame with random values in Scala

i have a dataframe with a parquet file and I have to add a new column with some random data, but I need that random data different each other.我有一个带有镶木地板文件的数据框,我必须添加一个包含一些随机数据的新列,但我需要这些随机数据彼此不同。 This is my actual code and the current version of spark is 1.5.1-cdh-5.5.2:这是我的实际代码,spark 的当前版本是 1.5.1-cdh-5.5.2:

val mydf = sqlContext.read.parquet("some.parquet")
// mydf.count()
// 63385686 
mydf.cache

val r = scala.util.Random
import org.apache.spark.sql.functions.udf
def myNextPositiveNumber :String = { (r.nextInt(Integer.MAX_VALUE) + 1 ).toString.concat("D")}
val myFunction = udf(myNextPositiveNumber _)
val myNewDF = mydf.withColumn("myNewColumn",lit(myNextPositiveNumber))

with this code, I have this data:使用此代码,我有以下数据:

scala> myNewDF.select("myNewColumn").show(10,false)
+-----------+
|myNewColumn|
+-----------+
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
|889488717D |
+-----------+

It looks like that the udf myNextPositiveNumber is invoked only once, isn't?看起来 udf myNextPositiveNumber 只被调用一次,不是吗?

update confirmed, there is only one distinct value:更新确认,只有一个不同的值:

scala> myNewDF.select("myNewColumn").distinct.show(50,false)
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
17/02/21 13:23:11 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
...

+-----------+                                                                   
|myNewColumn|
+-----------+
|889488717D |
+-----------+

what do I am doing wrong?我做错了什么?

Update 2: finally, with the help of @user6910411 I have this code:更新 2:最后,在@user6910411 的帮助下,我得到了以下代码:

val mydf = sqlContext.read.parquet("some.parquet")
// mydf.count()
// 63385686 
mydf.cache

val r = scala.util.Random

import org.apache.spark.sql.functions.udf

val accum = sc.accumulator(1)

def myNextPositiveNumber():String = {
   accum+=1
   accum.value.toString.concat("D")
}

val myFunction = udf(myNextPositiveNumber _)

val myNewDF = mydf.withColumn("myNewColumn",lit(myNextPositiveNumber))

myNewDF.select("myNewColumn").count

// 63385686

update 3更新 3

Actual code generates data like this:实际代码生成如下数据:

scala> mydf.select("myNewColumn").show(5,false)
17/02/22 11:01:57 WARN ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+-----------+
|myNewColumn|
+-----------+
|2D         |
|2D         |
|2D         |
|2D         |
|2D         |
+-----------+
only showing top 5 rows

It looks like the udf function is invoked only once, isn't?看起来 udf 函数只被调用一次,不是吗? I need a new random element in that column.我需要在该列中添加一个新的随机元素。

update 4 @user6910411更新 4 @user6910411

i have this actual code that increases the id but it is not concatenating the final char, it is weird.我有这个增加 id 的实际代码,但它没有连接最终的字符,这很奇怪。 This is my code:这是我的代码:

import org.apache.spark.sql.functions.udf


val mydf = sqlContext.read.parquet("some.parquet")

mydf.cache

def myNextPositiveNumber():String = monotonically_increasing_id().toString().concat("D")

val myFunction = udf(myNextPositiveNumber _)

val myNewDF = mydf.withColumn("myNewColumn",expr(myNextPositiveNumber))

scala> myNewDF.select("myNewColumn").show(5,false)
17/02/22 12:00:02 WARN Executor: 1 block locks were not released by TID = 1:
[rdd_4_0]
+-----------+
|myNewColumn|
+-----------+
|0          |
|1          |
|2          |
|3          |
|4          |
+-----------+

I need something like:我需要类似的东西:

+-----------+
|myNewColumn|
+-----------+
|1D         |
|2D         |
|3D         |
|4D         |
+-----------+

Spark >= 2.3火花 >= 2.3

It is possible to disable some optimizations using asNondeterministic method:可以使用asNondeterministic方法禁用一些优化:

import org.apache.spark.sql.expressions.UserDefinedFunction

val f: UserDefinedFunction = ???
val fNonDeterministic: UserDefinedFunction = f.asNondeterministic

Please make sure you understand the guarantees before using this option.在使用此选项之前,请确保您了解这些保证。

Spark < 2.3火花 < 2.3

Function which is passed to udf should be deterministic (with possible exception of SPARK-20586 ) and nullary functions calls can be replaced by constants.传递给udf 的函数应该是确定性的( SPARK-20586可能除外),并且空函数调用可以由常量替换。 If you want to generate random numbers use on of the built-in functions:如果要生成随机数,请使用内置函数:

  • rand - Generate a random column with independent and identically distributed (iid) samples from U[0.0, 1.0]. rand -从 U[0.0, 1.0] 生成具有独立同分布 (iid) 样本的随机列。
  • randn - Generate a column with independent and identically distributed (iid) samples from the standard normal distribution. randn -从标准正态分布中生成具有独立同分布 (iid) 样本的列。

and transform the output to obtain required distribution for example:并转换输出以获得所需的分布,例如:

(rand * Integer.MAX_VALUE).cast("bigint").cast("string")

You can make use of monotonically_increasing_id to generate random values.您可以使用monotonically_increasing_id来生成随机值。

Then you can define a UDF to append any string to it after casting it to String as monotonically_increasing_id returns Long by default.然后,您可以定义一个 UDF,在将其转换为 String 后将任何字符串附加到它,因为monotonically_increasing_id默认返回 Long。

scala> var df = Seq(("Ron"), ("John"), ("Steve"), ("Brawn"), ("Rock"), ("Rick")).toDF("names")
+-----+
|names|
+-----+
|  Ron|
| John|
|Steve|
|Brawn|
| Rock|
| Rick|
+-----+

scala> val appendD = spark.sqlContext.udf.register("appendD", (s: String) => s.concat("D"))

scala> df = df.withColumn("ID",monotonically_increasing_id).selectExpr("names","cast(ID as String) ID").withColumn("ID",appendD($"ID"))
+-----+---+
|names| ID|
+-----+---+
|  Ron| 0D|
| John| 1D|
|Steve| 2D|
|Brawn| 3D|
| Rock| 4D|
| Rick| 5D|
+-----+---+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何向我的 DataFrame 添加新列,以便新列的值由 scala 中的其他一些 function 填充? - How to add a new column to my DataFrame such that values of new column are populated by some other function in scala? 如何使用Scala / Spark 2.2将列添加到现有DataFrame并使用window函数在新列中添加特定行 - How to add a column to the existing DataFrame and using window function to add specific rows in the new column using Scala/Spark 2.2 如何基于Spark Scala中的现有列添加新列 - How add new column based on existing column in spark scala 如何在scala中的空数据框现有列上添加赋值? - How to add assign value to empty dataframe existing column in scala? 如何进行 groupby 排名并将其作为列添加到 spark scala 中的现有 dataframe? - How to do a groupby rank and add it as a column to existing dataframe in spark scala? 如何添加新列以触发数据框取决于multipme现有列? - how to add new column to spark dataframe depend on multipme existing column? 如何使用 Scala 在 DataFrame 中添加新的可为空字符串列 - How to add a new nullable String column in a DataFrame using Scala 在现有列的DataFrame中添加新列 - Add new column in DataFrame base on existing column 如何在向现有数据框添加新列的同时指定其数据类型? - How to add a new column to an existing dataframe while also specifying the datatype of it? Scala通过表达式向数据框添加新列 - Scala add new column to dataframe by expression
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM