简体繁体 English

解析存储为字符串 Spark 的结构列数组

[英]Resolve Array of Structs column stored as a string Spark

原文 2021-07-21 11:18:27 3 1 json/ scala/ apache-spark/ apache-spark-sql

I have the below Array of Structs stored as a string column in my input file.我将下面的结构数组作为字符串列存储在我的输入文件中。 The input file cannot be modified because it is sent from an upstream team.无法修改输入文件，因为它是从上游团队发送的。

[{"col1":"B078Z655KG","col2":2,"col3":351,"col4":"kindle_edition","col6":1,"transaction_info":[]},{"col1":"0736973540","col2":1,"col3":14,"col4":"paperback","col5":1,"col6":[]}]

My Code:我的代码：

val ds = df.select(col("faceout_features"))

val schema = StructType(Seq(StructField(“col1”, CatalystSqlParser.parseDataType("string")), StructField(“col2”, CatalystSqlParser.parseDataType("string"))))

val dfFromCSVJSON = df.select(col(“col1”), from_json(col("faceout_features"),schema).as("jsonData")) .select(“col1”,”jsonData.*")

I am getting the output columns as null when trying to parse as json and the input record is processed as a corrupt record instead.尝试解析为 json 时，我将输出列设为 null，而输入记录被处理为损坏的记录。 Any help in debugging the issues with the file format is appreciated.对调试文件格式问题的任何帮助表示赞赏。

Error:错误：

Found at least one malformed records (sample: "[{""col1"":""B078Z655KG"")

1 个解决方案

The error message shows you one of the records that cannot be parsed.错误消息显示无法解析的记录之一。 The general approach to debugging this would be to look at that record and try to parse it outside for spark.调试这个的一般方法是查看该记录并尝试在外部解析它以获取火花。 You could paste it into a text editor that supports JSON and see if it highlights any issues.您可以将其粘贴到支持 JSON 的文本编辑器中，看看它是否突出显示了任何问题。 Once you find the issue, be sure to check if any other rows have the same issue.找到问题后，请务必检查其他行是否存在相同问题。

For the specific error message shown in the question, we can already tell that it isn't valid JSON because of the double quotes.对于问题中显示的特定错误消息，由于双引号，我们已经可以判断它不是有效的 JSON。 In other words, ""B078Z655KG"" should be "B078Z655KG" .换句话说， ""B078Z655KG""应该是"B078Z655KG" 。