简体   繁体   English

在Spark数据集<Row>中使用custome UDF withColumn; java.lang.String无法强制转换为org.apache.spark.sql.Row

[英]Using custome UDF withColumn in a Spark Dataset<Row>; java.lang.String cannot be cast to org.apache.spark.sql.Row

I have a JSON file containing many fields. 我有一个包含许多字段的JSON文件。 I read the file using spark's Dataset in java. 我在java中使用spark的Dataset读取文件。

  • Spark version 2.2.0 Spark版本2.2.0

  • java jdk 1.8.0_121 java jdk 1.8.0_121

Below is the code. 下面是代码。

SparkSession spark = SparkSession
              .builder()
              .appName("Java Spark SQL basic example")
              .config("spark.some.config.option", "some-value")
              .master("local")
              .getOrCreate();

Dataset<Row> df = spark.read().json("jsonfile.json");

I would like to use withColumn function with a custom UDF to add a new column. 我想使用带有自定义UDF的withColumn函数来添加新列。

UDF1 someudf = new UDF1<Row,String>(){
        public String call(Row fin) throws Exception{
            String some_str = fin.getAs("String");
            return some_str;
        }
    };
spark.udf().register( "some_udf", someudf, DataTypes.StringType );
df.withColumn( "procs", callUDF( "some_udf", col("columnx") ) ).show();

I get a cast error when I run the above code. 运行上面的代码时出现转换错误。 java.lang.String cannot be cast to org.apache.spark.sql.Row java.lang.String无法强制转换为org.apache.spark.sql.Row

Questions: 问题:

1 - Is reading into a dataset of rows the only option? 1 - 读取行数据集是唯一的选择吗? I can convert the df into a df of strings. 我可以将df转换为df的字符串。 but I will not be able to select fields. 但我无法选择字段。

2 - Tried but failed to define user defined datatype. 2 - 尝试但未能定义用户定义的数据类型。 I was not able to register the UDF with this custom UDDatatype. 我无法使用此自定义UDDatatype注册UDF。 do I need user defined datatypes here? 我需要用户定义的数据类型吗?

3 - and the main question, how can I cast from String to Row? 3 - 和主要问题,我如何从String转换为Row?

Part of the log is copied below: 部分日志复制如下:

Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.sql.Row
    at Risks.readcsv$1.call(readcsv.java:1)
    at org.apache.spark.sql.UDFRegistration$$anonfun$27.apply(UDFRegistration.scala:512)
        ... 16 more

Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$27: (string) => string)

Your help will be greatly appreciated. 对你的帮助表示感谢。

You are getting that exception because UDF will execute on column's data type which is not Row . 您正在获得该异常,因为UDF将在列的数据类型(不是Row Consider we have Dataset<Row> ds which has two columns col1 and col2 both are String type. 考虑我们有Dataset<Row> ds ,它有两列col1col2都是String类型。 Now if we want to convert the value of col2 to uppercase using UDF . 现在,如果我们想使用UDFcol2的值转换为大写。

We can register and call UDF like below. 我们可以注册并调用UDF如下所示。

spark.udf().register("toUpper", toUpper, DataTypes.StringType);
ds.select(col("*"),callUDF("toUpper", col("col2"))).show();

Or using withColumn 或者使用withColumn

ds.withColumn("Upper",callUDF("toUpper", col("col2"))).show();

And UDF should be like below. UDF应该如下所示。

private static UDF1 toUpper = new UDF1<String, String>() {
    public String call(final String str) throws Exception {
        return str.toUpperCase();
    }
};

Improving what @abaghel wrote. 改善@abaghel所写的内容。 If you use the following import 如果您使用以下导入

import org.apache.spark.sql.functions;

Using withColumn , code should be as follows: 使用withColumn ,代码应如下所示:

ds.withColumn("Upper",functions.callUDF("toUpper", ds.col("col2"))).show();

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache spark Row getAs[String]: java.lang.Byte 不能转换为 java.lang.String - Apache spark Row getAs[String] : java.lang.Byte cannot be cast to java.lang.String Spark SQL Java GenericRowWithSchema无法强制转换为java.lang.String - Spark SQL Java GenericRowWithSchema cannot be cast to java.lang.String 如何在没有 withColumn 的情况下将 Spark 数据集的所有列转换为 Java 中的字符串? - How to cast all columns of Spark Dataset to String in Java without withColumn? java.lang.ClassCastException: org.apache.spark.sql.Column cannot be cast to scala.collection.Seq - java.lang.ClassCastException: org.apache.spark.sql.Column cannot be cast to scala.collection.Seq java.lang.ClassCastException:org.apache.camel.builder.ValueBuilder无法转换为java.lang.String - java.lang.ClassCastException: org.apache.camel.builder.ValueBuilder cannot be cast to java.lang.String java.lang.ClassCastException:java.lang.String 无法转换为 org.apache.avro.generic.GenericRed - java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.avro.generic.GenericRecord java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset /Spark - JAVA - java.lang.RuntimeException: Unsupported literal type class org.apache.spark.sql.Dataset /Spark - JAVA Spark和Cassandra Java应用程序:线程“ main”中的异常java.lang.NoClassDefFoundError:org / apache / spark / sql / Dataset - Spark and Cassandra Java application: Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/sql/Dataset SoapUI-验证错误java.lang.String无法转换为org.apache.xmlbeans.XmlError - SoapUI - validation error java.lang.String cannot be cast to org.apache.xmlbeans.XmlError PIG:无法在商店内使用AvroStorage将java.lang.String转换为org.apache.avro.util.Utf8 - PIG: Cannot cast java.lang.String to org.apache.avro.util.Utf8 with AvroStorage inside STORE
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM