简体   繁体   English

以编程方式推断架构以准备Spark数据框数据集 <Row> 来自RDD <Row> 当某些Row对象可能包含不同数量的元素时

[英]Programmatically infer schema to prepare spark dataframe Dataset<Row> from RDD<Row> when some Row objects may contain different number of elements

I am fetching neo4j node information in spark rdd using neo4j-spark connector . 我正在使用neo4j-spark连接器在spark rdd中获取neo4j节点信息。 I can obtain RDD<Row> by calling loadNodeRdds() method . 我可以通过调用loadNodeRdds()方法获得RDD<Row> But when I try obtaining dataframe calling loadDataframe() method , it throws exception (skip stack trace if you find it too long as main question might turn out to be different in the end): 但是,当我尝试获取调用loadDataframe()方法的数据框时,它会引发异常(如果您发现它的时间太长,则跳过堆栈跟踪,只要最后一个主要问题可能会有所不同):

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.util.Collections$UnmodifiableMap is not a valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, Condition), StringType), true) AS Condition#4
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, Condition), StringType), true)
   :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
   :  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
   :  :  +- input[0, org.apache.spark.sql.Row, true]
   :  +- 0
   :- null
   +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, Condition), StringType), true)
      +- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, Condition), StringType)
         +- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, Condition)
            +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
               +- input[0, org.apache.spark.sql.Row, true]

(skipped a lot of rows as it made question reach its character limit)

I was not able to get much from big stackrace above. 我不能从上面的大量stackrace中获得很多收益。

So I took JavaRDD<Row> and tried converting it to DataFrame<Row> by programmatically specifying StructType schema . 因此,我采用JavaRDD<Row>并尝试通过以编程方式指定StructType schema将其转换为DataFrame<Row>

StructType schema = loadSchema();
Dataset<Row> df = ss.createDataFrame(neo4jJavaRdd , schema);

This threw somewhat similar exception. 这引发了一些类似的例外。

So what I did is that I took individual properties of single neo4j node, prepared Row and then JavaRDD<Row> from it and then tried to create dataframe from it by programatically specifying schema as follows: 因此,我要做的是,我获取了单个neo4j节点的各个属性,从中准备了Row ,然后准备了JavaRDD<Row> ,然后尝试通过以编程方式指定如下的模式来从中创建数据JavaRDD<Row>

Row row1 = RowFactory.create("val1", " val2", "val3", "val4", "val5", "val6", 152214d, "val7", 152206d, 11160d, "val8");
List<StructField> fields = new ArrayList<StructField>();
fields.add(DataTypes.createStructField("attr1", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attr2", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attr3", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attr4", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attr5", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attr6", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attrd1", DataTypes.DoubleType, true));
fields.add(DataTypes.createStructField("attr7", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attrd2", DataTypes.DoubleType, true));
fields.add(DataTypes.createStructField("attrd3", DataTypes.DoubleType, true));
fields.add(DataTypes.createStructField("attr8", DataTypes.StringType, true));

This worked. 这工作了。

So I checked all nodes and realized that not all nodes (that is all Row s in JavaRDD<Row> ) have same number of attributes. 因此,我检查了所有节点,并意识到并非所有节点(即JavaRDD<Row>所有Row )都具有相同数量的属性。 This is must be causing data frame preparation to fail. 这一定是导致数据帧准备失败的原因。 Can I handle it some way programatically without requiring to create and specify pojo . 我可以通过编程方式以某种方式处理它,而无需创建和指定pojo

If you would like to get this done using RDDs as you mentioned, do as follows: 如果您希望使用提到的RDD来完成此操作,请执行以下操作:

  • Before trying to transform the (RDD + schema) to a dataframe, go over the RDD (using the map function), and make sure each row has all the relevant attributes. 在尝试将(RDD +模式)转换为数据框之前,请遍历RDD(使用map函数),并确保每一行都具有所有相关属性。
  • If an attribute is not present in a row, add it and make it null. 如果某行中不存在某个属性,则将其添加并使其为空。

After that, your RDD rows will have the same schema, and the transformation to a dataframe will work. 之后,您的RDD行将具有相同的架构,并且可以转换为数据框。

There are certain things that I came to realize while working with neo4j-spark-connector that I want to share here. 在与neo4j-spark-connector一起工作时,有一些我想在这里分享的东西。

  1. In general, if you are going to prepare dataframe , its not preferrable to return neo4j's object types, specifically node and relationship. 通常,如果您要准备 dataframe,则最好不要返回neo4j的对象类型,特别是节点和关系。 That is something like below returning node is not preferrable: 就像下面这样,返回节点是不可取的:

     MATCH(n {id:'xyz'}) RETURN n 

    Instead return properties: 而是返回属性:

     MATCH(n {id:'xyz'}) RETURN properties(n) 
  2. If you are unsure that all nodes will not have same number of properties , then its better to return them explicitly, instead of returning properties and obtaining JavaRDD. 如果不确定所有节点将不具有相同数量的属性 ,那么最好显式返回它们,而不是返回属性并获取JavaRDD。 Since that will require us to process JavaRDD again to add NULL for non existent properties. 因为那将需要我们再次处理JavaRDD ,为不存在的属性添加NULL That, is instead of doing this: 那,而不是这样做:

     MATCH(n {id:'xyz'}) RETURN properties(n) 

    return in this way: 以这种方式返回:

     MATCH(n {id:'xyz'}) RETURN n.prop1 AS prop1, n.prop2 AS prop2, ..., n.propN AS propN 

    Neo4j will itself add NULL s for non existing properties as it can be seen image below, we dont have to iterate over them again. Neo4j本身将为不存在的属性添加NULL ,如下图所示,我们不必再次遍历它们。 By returning this, I was able to obtain neo4j node information directly by using loadDataframe() method. 通过返回此值,我可以使用loadDataframe()方法直接获取neo4j节点信息。

    在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark类型不匹配:无法从DataFrame转换为Dataset <Row> - Spark Type mismatch: cannot convert from DataFrame to Dataset<Row> JavaPairRDD到数据集 <Row> 在SPARK - JavaPairRDD to Dataset<Row> in SPARK 您如何更新数据集 <Row> 来自另一个数据集的记录 <Row> JAVA API在Spark中哪些具有相同的架构? - How Do you Update a Dataset<Row> with records from another Dataset<Row> which have Identical Schema in Spark with JAVA API? 在循环中从 Spark 数据集中读取行数据 - Reading Row data from Spark Dataset in Loop 转换数据集时出现RuntimeException <Row> 到JavaRDD <Row> 然后是数据框 - RuntimeException when converting Dataset<Row> to JavaRDD<Row> and then Dataframe 如何将行转换为数据集<Row>在火花Java - How to convert Row to Dataset<Row> in spark Java 如何解码列表的字节 []<Objects> 到数据集<Row>在火花? - How to decode a byte[] of List<Objects> to Dataset<Row> in spark? 在Apache Spark中,转换JavaRDD <Row> 到数据集 <Row> 给出异常:ArrayList不是字符串模式的有效外部类型 - In Apache Spark, converting JavaRDD<Row> to Dataset<Row> gives exception: ArrayList is not a valid external type for schema of string 更改 Spark 数据集中的值<Row> - Change value in Spark Dataset<Row> 根据列的值添加spark数据集中的行号 - Adding the row number in spark dataset based on the values of column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM