[英]Spark 2.0 reading json into a dataframe with quotes in a key - different behaviour than spark 1.6… bug?
不幸的是,我們不得不處理混亂的傳入json數據,並且發現Spark 2.0(pyspark)處理json密鑰中的引號的方式有所不同。
如果我們將以下內容用作示例文件(sample.json):
{"event":"abc"}
{"event":"xyz","otherdata[\"this.is.ugly\"]":"value1"}
在Spark 1.6.2中,我們可以運行以下命令並獲得結果:
from pyspark import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setAppName('temp_quotes')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
data = sqlContext.read.json("sample.json")
data.printSchema()
結果是:
root
|-- event: string (nullable = true)
|-- otherdata["this.is.ugly"]: string (nullable = true)
表演時我們可以看到數據:
data.show(2)
+-----+-------------------------+
|event|otherdata["this.is.ugly"]|
+-----+-------------------------+
| abc| null|
| xyz| value1|
+-----+-------------------------+
但是,在Spark 2.0中運行相同的代碼將顯示相同的架構:
from pyspark import SparkConf
from pyspark.sql import SQLContext
conf = SparkConf()
conf.setAppName('temp_quotes')
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
data = sqlContext.read.json("sample.json")
data.printSchema()
root
|-- event: string (nullable = true)
|-- otherdata["this.is.ugly"]: string (nullable = true)
但是演出失敗了:
data.show(2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 287, in show
print(self._jdf.showString(n, truncate))
File "/usr/lib/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'Unable to resolve otherdata["this.is.ugly"] given [event, otherdata["this.is.ugly"]];'
這是錯誤還是我缺少的Spark 2.0中的參數?
我相信可以在https://issues.apache.org/jira/browse/SPARK-16698(JSON密鑰中的點)中解決此問題。 該修復程序計划在2.0.1中發布。
(我沒有足夠的聲譽來發表評論)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.