簡體   English   中英

Json 文件到 pyspark 數據框

[英]Json file to pyspark dataframe

我正在嘗試在 spark (pyspark) 環境中使用 JSON 文件。

問題:無法在 Pyspark Dataframe 中將 JSON 轉換為預期格式

第一個輸入數據集:

https://health.data.ny.gov/api/views/cnih-y5dw/rows.json

在此文件中,元數據在文件的開頭定義,帶有“meta”標簽,然后是帶有“data”標簽的數據。

僅供參考:將數據從網絡下載到本地驅動器所采取的步驟。 1. 我已經將文件下載到我的本地驅動器 2. 然后推送到 hdfs - 從那里我正在閱讀它以激發環境。

df=sqlContext.read.json("/user/train/ny.json",multiLine=True)
df.count()
out[5]: 1
df.show()

在此處輸入圖片說明

df.printSchema()

root
 |-- data: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- meta: struct (nullable = true)
 |    |-- view: struct (nullable = true)
 |    |    |-- attribution: string (nullable = true)
 |    |    |-- attributionLink: string (nullable = true)
 |    |    |-- averageRating: long (nullable = true)
 |    |    |-- category: string (nullable = true)
 |    |    |-- columns: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- cachedContents: struct (nullable = true)
 |    |    |    |    |    |-- average: string (nullable = true)
 |    |    |    |    |    |-- largest: string (nullable = true)
 |    |    |    |    |    |-- non_null: long (nullable = true)
 |    |    |    |    |    |-- null: long (nullable = true)
 |    |    |    |    |    |-- smallest: string (nullable = true)
 |    |    |    |    |    |-- sum: string (nullable = true)
 |    |    |    |    |    |-- top: array (nullable = true)
 |    |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |    |-- count: long (nullable = true)
 |    |    |    |    |    |    |    |-- item: string (nullable = true)
 |    |    |    |    |-- dataTypeName: string (nullable = true)
 |    |    |    |    |-- description: string (nullable = true)
 |    |    |    |    |-- fieldName: string (nullable = true)
 |    |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- format: struct (nullable = true)
 |    |    |    |    |    |-- align: string (nullable = true)
 |    |    |    |    |    |-- mask: string (nullable = true)
 |    |    |    |    |    |-- noCommas: string (nullable = true)
 |    |    |    |    |    |-- precisionStyle: string (nullable = true)
 |    |    |    |    |-- id: long (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- position: long (nullable = true)
 |    |    |    |    |-- renderTypeName: string (nullable = true)
 |    |    |    |    |-- tableColumnId: long (nullable = true)
 |    |    |    |    |-- width: long (nullable = true)
 |    |    |-- createdAt: long (nullable = true)
 |    |    |-- description: string (nullable = true)
 |    |    |-- displayType: string (nullable = true)
 |    |    |-- downloadCount: long (nullable = true)
 |    |    |-- flags: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- grants: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |-- inherited: boolean (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |-- hideFromCatalog: boolean (nullable = true)
 |    |    |-- hideFromDataJson: boolean (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- indexUpdatedAt: long (nullable = true)
 |    |    |-- metadata: struct (nullable = true)
 |    |    |    |-- attachments: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- assetId: string (nullable = true)
 |    |    |    |    |    |-- blobId: string (nullable = true)
 |    |    |    |    |    |-- filename: string (nullable = true)
 |    |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- availableDisplayTypes: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- custom_fields: struct (nullable = true)
 |    |    |    |    |-- Additional Resources: struct (nullable = true)
 |    |    |    |    |    |-- See Also: string (nullable = true)
 |    |    |    |    |-- Dataset Information: struct (nullable = true)
 |    |    |    |    |    |-- Agency: string (nullable = true)
 |    |    |    |    |-- Dataset Summary: struct (nullable = true)
 |    |    |    |    |    |-- Contact Information: string (nullable = true)
 |    |    |    |    |    |-- Coverage: string (nullable = true)
 |    |    |    |    |    |-- Data Frequency: string (nullable = true)
 |    |    |    |    |    |-- Dataset Owner: string (nullable = true)
 |    |    |    |    |    |-- Granularity: string (nullable = true)
 |    |    |    |    |    |-- Organization: string (nullable = true)
 |    |    |    |    |    |-- Posting Frequency: string (nullable = true)
 |    |    |    |    |    |-- Time Period: string (nullable = true)
 |    |    |    |    |    |-- Units: string (nullable = true)
 |    |    |    |    |-- Disclaimers: struct (nullable = true)
 |    |    |    |    |    |-- Limitations: string (nullable = true)
 |    |    |    |    |-- Local Data: struct (nullable = true)
 |    |    |    |    |    |-- County Filter: string (nullable = true)
 |    |    |    |    |    |-- County_Column: string (nullable = true)
 |    |    |    |-- filterCondition: struct (nullable = true)
 |    |    |    |    |-- children: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- metadata: struct (nullable = true)
 |    |    |    |    |    |    |    |-- includeAuto: long (nullable = true)
 |    |    |    |    |    |    |    |-- multiSelect: boolean (nullable = true)
 |    |    |    |    |    |    |    |-- operator: string (nullable = true)
 |    |    |    |    |    |    |    |-- tableColumnId: struct (nullable = true)
 |    |    |    |    |    |    |    |    |-- 583607: long (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |-- metadata: struct (nullable = true)
 |    |    |    |    |    |-- advanced: boolean (nullable = true)
 |    |    |    |    |    |-- unifiedVersion: long (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |-- jsonQuery: struct (nullable = true)
 |    |    |    |    |-- order: array (nullable = true)
 |    |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |    |-- ascending: boolean (nullable = true)
 |    |    |    |    |    |    |-- columnFieldName: string (nullable = true)
 |    |    |    |-- rdfSubject: string (nullable = true)
 |    |    |    |-- renderTypeConfig: struct (nullable = true)
 |    |    |    |    |-- visible: struct (nullable = true)
 |    |    |    |    |    |-- table: boolean (nullable = true)
 |    |    |    |-- rowLabel: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- newBackend: boolean (nullable = true)
 |    |    |-- numberOfComments: long (nullable = true)
 |    |    |-- oid: long (nullable = true)
 |    |    |-- owner: struct (nullable = true)
 |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- profileImageUrlLarge: string (nullable = true)
 |    |    |    |-- profileImageUrlMedium: string (nullable = true)
 |    |    |    |-- profileImageUrlSmall: string (nullable = true)
 |    |    |    |-- screenName: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |-- provenance: string (nullable = true)
 |    |    |-- publicationAppendEnabled: boolean (nullable = true)
 |    |    |-- publicationDate: long (nullable = true)
 |    |    |-- publicationGroup: long (nullable = true)
 |    |    |-- publicationStage: string (nullable = true)
 |    |    |-- query: struct (nullable = true)
 |    |    |    |-- orderBys: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- ascending: boolean (nullable = true)
 |    |    |    |    |    |-- expression: struct (nullable = true)
 |    |    |    |    |    |    |-- columnId: long (nullable = true)
 |    |    |    |    |    |    |-- type: string (nullable = true)
 |    |    |-- rights: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- rowsUpdatedAt: long (nullable = true)
 |    |    |-- rowsUpdatedBy: string (nullable = true)
 |    |    |-- tableAuthor: struct (nullable = true)
 |    |    |    |-- displayName: string (nullable = true)
 |    |    |    |-- flags: array (nullable = true)
 |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- profileImageUrlLarge: string (nullable = true)
 |    |    |    |-- profileImageUrlMedium: string (nullable = true)
 |    |    |    |-- profileImageUrlSmall: string (nullable = true)
 |    |    |    |-- screenName: string (nullable = true)
 |    |    |    |-- type: string (nullable = true)
 |    |    |-- tableId: long (nullable = true)
 |    |    |-- tags: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- totalTimesRated: long (nullable = true)
 |    |    |-- viewCount: long (nullable = true)
 |    |    |-- viewLastModified: long (nullable = true)
 |    |    |-- viewType: string (nullable = true)

問題:所有記錄都包含在單行兩列中,即元數據和數據。 同樣使用 spark 本機 JSON 實用程序 - Spark 自動推斷模式(元數據) - 我的期望是它不應該明確作為數據幀上的單獨列。

預期輸出JSON 數據集具有以下列列表。 它應該在數據框中以表格格式顯示它們,我可以在其中查詢它們”

FACILITY, ADDRESS, LAST INSPECTED, VIOLATIONS,TOTAL  CRITICAL VIOLATIONS, TOTAL CRIT.  NOT CORRECTED, TOTAL  NONCRITICAL VIOLATIONS, DESCRIPTION, LOCAL HEALTH DEPARTMENT, COUNTY, FACILITY ADDRESS, CITY, ZIP CODE, NYSDOH GAZETTEER (1980), MUNICIPALITY, OPERATION NAME, PERMIT EXPIRATION DATE, PERMITTED  (D/B/A), PERMITTED  CORP. NAME,PERM. OPERATOR LAST NAME, PERM. OPERATOR LAST NAME, PERM. OPERATOR FIRST NAME, NYS HEALTH OPERATION ID, INSPECTION TYPE, INSPECTION COMMENTS, FOOD SERVICE FACILITY STATE, Location1

2nd Input DataSet:在現場,世界銀行資助項目的第一個數據集

http://jsonstudio.com/resources/

(現場為世界銀行首個資助項目數據集)

它工作正常。

df=sqlContext.read.json("/user/train/wb.json")
df.count()
500 

2nd Input Data Sets 可以工作,但 1st Input dataset 不行。 我的觀察是兩個 Json 文件定義元數據的方式不同。 在 1 日。 首先定義元數據,然后定義第二個文件中的數據 - 每一行都有數據。

您能否指導我了解第一個輸入 JSON 文件格式以及如何在將其轉換為 pyspark 數據幀時處理情況?

更新結果:經過初步分析,我們發現格式似乎是錯誤的,但社區成員幫助了另一種閱讀格式的方法。 將答案標記為正確並關閉此線程。

如果您需要更多詳細信息,請告訴我,提前致謝。

在數據上查看我的筆記本

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/753971180331031/2636709537891264/846924743172/

第一個數據集已損壞,即它不是有效的json ,因此 spark 無法讀取它。

但這是針對 spark 2.2.1

這尤其令人困惑,因為這個json文件的組織方式將數據存儲為列表列表

df=spark.read.json("rows.json",multiLine=True)
data = df.select("data").collect()[0]['data']

並且列名是分開存儲的

column_names = map(lambda x: x['fieldName'], df.select("meta").collect()[0][0][0][4])

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM