简体   繁体   English

spark sql cast函数使用NULLS创建列

[英]spark sql cast function creates column with NULLS

I have the following dataframe and schema in Spark 我在Spark中具有以下数据框和架构

val df = spark.read.options(Map("header"-> "true")).csv("path")

scala> df show()

+-------+-------+-----+
|   user|  topic| hits|
+-------+-------+-----+
|     om|  scala|  120|
| daniel|  spark|   80|
|3754978|  spark|    1|
+-------+-------+-----+

scala> df printSchema

root
 |-- user: string (nullable = true)
 |--  topic: string (nullable = true)
 |--  hits: string (nullable = true)

I want to change the column hits to integer 我想将列匹配更改为整数

I tried this: 我尝试了这个:

scala>    df.createOrReplaceTempView("test")
    val dfNew = spark.sql("select *, cast('hist' as integer) as hist2 from test")

scala> dfNew.printSchema

root
 |-- user: string (nullable = true)
 |--  topic: string (nullable = true)
 |--  hits: string (nullable = true)
 |-- hist2: integer (nullable = true)

but when I print the dataframe the column hist 2 is filled with NULLS 但是当我打印数据框时,hist 2列填充为NULL

scala> dfNew show()

+-------+-------+-----+-----+
|   user|  topic| hits|hist2|
+-------+-------+-----+-----+
|     om|  scala|  120| null|
| daniel|  spark|   80| null|
|3754978|  spark|    1| null|
+-------+-------+-----+-----+

I also tried this: 我也试过这个:

scala> val df2 = df.withColumn("hitsTmp",
df.hits.cast(IntegerType)).drop("hits"
).withColumnRenamed("hitsTmp", "hits")

and got this: 并得到了:

<console>:26: error: value hits is not a member of org.apache.spark.sql.DataFram
e

Also tried this: 还尝试了这个:

scala> val df2 = df.selectExpr ("user","topic","cast(hits as int) hits")

and got this:
org.apache.spark.sql.AnalysisException: cannot resolve '`topic`' given input col
umns: [user,  topic,  hits]; line 1 pos 0;
'Project [user#0, 'topic, cast('hits as int) AS hits#22]
+- Relation[user#0, topic#1, hits#2] csv

with

 scala> val df2 = df.selectExpr ("cast(hits as int) hits") 

I get similar error. 我收到类似的错误。

Any help will be appreciated. 任何帮助将不胜感激。 I know this question has been addressed before but I tried 3 different approaches (published here) and none is working. 我知道这个问题之前已经解决过,但是我尝试了3种不同的方法(在此处发布),但没有一个起作用。

Thanks. 谢谢。

You can cast a column to Integer type in following ways 您可以通过以下方式将列转换为整数类型

df.withColumn("hits", df("hits").cast("integer"))

Or 要么

data.withColumn("hitsTmp",
      data("hits").cast(IntegerType)).drop("hits").
      withColumnRenamed("hitsTmp", "hits")

Or 要么

data.selectExpr ("user","topic","cast(hits as int) hits")

The response is delayed but i was facing the same issue & worked.So thought to put it over here. 响应被延迟了,但是我也遇到了同样的问题并进行了工作。因此请考虑将其放在此处。 Might be of help to someone. 可能会对某人有所帮助。 Try declaring the schema as StructType. 尝试将架构声明为StructType。 Reading from CSV files & providing inferential schema using case class gives weird errors for data types. 从CSV文件读取并使用案例类提供推论架构会给数据类型带来奇怪的错误。 Although, all the data formats are properly specified. 虽然,所有数据格式都已正确指定。

I know that this answer probably won't be useful for the OP since it comes with a ~2 year delay. 我知道这个答案可能不会对OP有用,因为它会延迟约2年。 It might however be helpful for someone facing this problem. 但是,这可能会对面临此问题的人有所帮助。

Just like you, I had a dataframe with a column of strings which I was trying to cast to integer: 就像您一样,我有一个带有一列字符串的数据框,我试图将其转换为整数:

> df.show
+-------+
|     id|
+-------+
|4918088|
|4918111|
|4918154|
   ...

> df.printSchema
root
 |-- id: string (nullable = true)

But after doing the cast to IntegerType the only thing I obtained, just as you did, was a column of nulls: 但是,在完成对IntegerType ,我获得的唯一一件事就是一列null:

> df.withColumn("test", $"id".cast(IntegerType))
    .select("id","test")
    .show
+-------+----+
|     id|test|
+-------+----+
|4918088|null|
|4918111|null|
|4918154|null|
      ...

By default if you try to cast a string that contain non-numeric characters to integer the cast of the column won't fail but those values will be set to null as you can see in the following example: 默认情况下,如果您尝试将包含非数字字符的字符串强制转换为整数,则该列的强制转换不会失败,但是这些值将设置为null如以下示例所示:

> val testDf = sc.parallelize(Seq(("1"), ("2"), ("3A") )).toDF("n_str")
> testDf.withColumn("n_int", $"n_str".cast(IntegerType))
        .select("n_str","n_int")
        .show
+-----+-----+
|n_str|n_int|
+-----+-----+
|    1|    1|
|    2|    2|
|   3A| null|
+-----+-----+

The thing with our initial dataframe is that, at first sight, when we use the show method, we can't see any non-numeric character. 初始数据帧的问题是,乍一看,当我们使用show方法时,我们看不到任何非数字字符。 However, if you a row from your dataframe you'll see something different: 但是,如果您在数据框中一行,则会看到不同的内容:

> df.first
org.apache.spark.sql.Row = [4?9?1?8?0?8?8??]

Why is this happening? 为什么会这样呢? You are probably reading a csv file containing a non-supported encoding. 您可能正在读取包含不支持的编码的csv文件。

You can solve this by changing the encoding of the file you are reading. 您可以通过更改正在读取的文件的编码来解决此问题。 If that is not an option you can also clean each column before doing the type cast. 如果不是这种选择,您还可以在执行类型转换之前清理每一列。 An example: 一个例子:

> val df_cast = df.withColumn("test", regexp_replace($"id", "[^0-9]","").cast(IntegerType))
                  .select("id","test")
> df_cast.show
+-------+-------+
|     id|   test|
+-------+-------+
|4918088|4918088|
|4918111|4918111|
|4918154|4918154|
       ...

> df_cast.printSchema
root
 |-- id: string (nullable = true)
 |-- test: integer (nullable = true)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM