以编程方式推断架构以准备Spark数据框数据集 <Row> 来自RDD <Row> 当某些Row对象可能包含不同数量的元素时

Question

I am fetching neo4j node information in spark rdd using neo4j-spark connector . 我正在使用neo4j-spark连接器在spark rdd中获取neo4j节点信息。 I can obtain RDD<Row> by calling loadNodeRdds() method . 我可以通过调用loadNodeRdds()方法获得RDD<Row> 。 But when I try obtaining dataframe calling loadDataframe() method , it throws exception (skip stack trace if you find it too long as main question might turn out to be different in the end): 但是，当我尝试获取调用loadDataframe()方法的数据框时，它会引发异常（如果您发现它的时间太长，则跳过堆栈跟踪，只要最后一个主要问题可能会有所不同）：

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.util.Collections$UnmodifiableMap is not a valid external type for schema of string
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, Condition), StringType), true) AS Condition#4
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, Condition), StringType), true)
   :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
   :  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
   :  :  +- input[0, org.apache.spark.sql.Row, true]
   :  +- 0
   :- null
   +- staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, Condition), StringType), true)
      +- validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, Condition), StringType)
         +- getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, Condition)
            +- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
               +- input[0, org.apache.spark.sql.Row, true]

(skipped a lot of rows as it made question reach its character limit)

I was not able to get much from big stackrace above. 我不能从上面的大量stackrace中获得很多收益。

So I took JavaRDD<Row> and tried converting it to DataFrame<Row> by programmatically specifying StructType schema . 因此，我采用JavaRDD<Row>并尝试通过以编程方式指定StructType schema将其转换为DataFrame<Row> 。

StructType schema = loadSchema();
Dataset<Row> df = ss.createDataFrame(neo4jJavaRdd , schema);

This threw somewhat similar exception. 这引发了一些类似的例外。

So what I did is that I took individual properties of single neo4j node, prepared Row and then JavaRDD<Row> from it and then tried to create dataframe from it by programatically specifying schema as follows: 因此，我要做的是，我获取了单个neo4j节点的各个属性，从中准备了Row ，然后准备了JavaRDD<Row> ，然后尝试通过以编程方式指定如下的模式来从中创建数据JavaRDD<Row> ：

Row row1 = RowFactory.create("val1", " val2", "val3", "val4", "val5", "val6", 152214d, "val7", 152206d, 11160d, "val8");
List<StructField> fields = new ArrayList<StructField>();
fields.add(DataTypes.createStructField("attr1", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attr2", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attr3", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attr4", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attr5", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attr6", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attrd1", DataTypes.DoubleType, true));
fields.add(DataTypes.createStructField("attr7", DataTypes.StringType, true));
fields.add(DataTypes.createStructField("attrd2", DataTypes.DoubleType, true));
fields.add(DataTypes.createStructField("attrd3", DataTypes.DoubleType, true));
fields.add(DataTypes.createStructField("attr8", DataTypes.StringType, true));

This worked. 这工作了。

So I checked all nodes and realized that not all nodes (that is all Row s in JavaRDD<Row> ) have same number of attributes. 因此，我检查了所有节点，并意识到并非所有节点（即JavaRDD<Row>所有Row ）都具有相同数量的属性。 This is must be causing data frame preparation to fail. 这一定是导致数据帧准备失败的原因。 Can I handle it some way programatically without requiring to create and specify pojo . 我可以通过编程方式以某种方式处理它，而无需创建和指定pojo 。

Answer 1

If you would like to get this done using RDDs as you mentioned, do as follows: 如果您希望使用提到的RDD来完成此操作，请执行以下操作：

Before trying to transform the (RDD + schema) to a dataframe, go over the RDD (using the map function), and make sure each row has all the relevant attributes. 在尝试将（RDD +模式）转换为数据框之前，请遍历RDD（使用map函数），并确保每一行都具有所有相关属性。
If an attribute is not present in a row, add it and make it null. 如果某行中不存在某个属性，则将其添加并使其为空。

After that, your RDD rows will have the same schema, and the transformation to a dataframe will work. 之后，您的RDD行将具有相同的架构，并且可以转换为数据框。

Answer 2

There are certain things that I came to realize while working with neo4j-spark-connector that I want to share here. 在与neo4j-spark-connector一起工作时，有一些我想在这里分享的东西。

In general, if you are going to prepare dataframe , its not preferrable to return neo4j's object types, specifically node and relationship. 通常，如果您要准备 dataframe，则最好不要返回neo4j的对象类型，特别是节点和关系。 That is something like below returning node is not preferrable: 就像下面这样，返回节点是不可取的：
```
 MATCH(n {id:'xyz'}) RETURN n 
```
Instead return properties: 而是返回属性：
```
 MATCH(n {id:'xyz'}) RETURN properties(n) 
```
If you are unsure that all nodes will not have same number of properties , then its better to return them explicitly, instead of returning properties and obtaining JavaRDD. 如果不确定所有节点将不具有相同数量的属性 ，那么最好显式返回它们，而不是返回属性并获取JavaRDD。 Since that will require us to process JavaRDD again to add NULL for non existent properties. 因为那将需要我们再次处理JavaRDD ，为不存在的属性添加NULL 。 That, is instead of doing this: 那，而不是这样做：
```
 MATCH(n {id:'xyz'}) RETURN properties(n) 
```
return in this way: 以这种方式返回：
```
 MATCH(n {id:'xyz'}) RETURN n.prop1 AS prop1, n.prop2 AS prop2, ..., n.propN AS propN 
```
Neo4j will itself add NULL s for non existing properties as it can be seen image below, we dont have to iterate over them again. Neo4j本身将为不存在的属性添加NULL ，如下图所示，我们不必再次遍历它们。 By returning this, I was able to obtain neo4j node information directly by using loadDataframe() method. 通过返回此值，我可以使用loadDataframe()方法直接获取neo4j节点信息。

以编程方式推断架构以准备Spark数据框数据集 <Row> 来自RDD <Row> 当某些Row对象可能包含不同数量的元素时

问题描述

2 个解决方案

解决方案1
0 2018-05-27 20:17:24

解决方案2
0 2018-06-04 07:26:21

以编程方式推断架构以准备Spark数据框数据集 <Row> 来自RDD <Row> 当某些Row对象可能包含不同数量的元素时

问题描述

2 个解决方案

解决方案1 0 2018-05-27 20:17:24

解决方案2 0 2018-06-04 07:26:21

解决方案1
0 2018-05-27 20:17:24

解决方案2
0 2018-06-04 07:26:21