PySPark：如何從pyspark中的變量創建JSON和CSV文件？

Question

我試圖將變量的結果寫入csv文件，然后從中創建一個json。 for循環的每次迭代都會將以下結果寫入變量res_df。 如果可以直接創建一個json而無需創建一個csv，那么我也很樂意實現它。 請幫忙。

'var_id', 10000001, 14003088.0, 14228946.912793402, 1874168.857698741, 15017976.0, 18000192, 0

現在，我想將此結果附加到一個csv文件中，然后從中創建一個json。 我已經在我的python代碼中實現了它。 現在需要您如何在pyspark中實現相同的幫助

Python代碼：

res_df=line,x.min(),np.percentile(x, 25),np.mean(x),np.std(x),np.percentile(x, 75),x.max(),df[line].isnull().mean() * 100
        with open(data_output_file, 'a', newline='') as csvfile:
            writerows = csv.writer(csvfile, delimiter=',',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
            writerows.writerow(map(lambda x: x, res_df))

quality_json_df = pd.read_csv(r'./DQ_RESULT.csv')
# it will dump json to file
quality_json_df.to_json("./Dq_Data.json", orient="records")

我的Pyspark代碼

for line in tcp.collect():
        #print value in MyCol1 for each row                
        print line
        v3=np.array(data.select(line).collect())
        x = v3[np.logical_not(np.isnan(v3))] 
        print(x)
        cnt_null=data.filter((data[line] == "") | data[line].isNull() | isnan(data[line])).count()
        print(cnt_null)
        res_df=line,x.min(),np.percentile(x, 25),np.mean(x),np.std(x),np.percentile(x, 75),x.max(),cnt_null
        print(res_df)

Answer 1

json_output = []
column_statistic = ["variable_name", "min", "Q1", "mean", "std", "Q3", "max", "null_value"]
for line in tcp.collect():
        # print value in MyCol1 for each row
        print
        line
        v3 = np.array(data.select(line).collect())
        x = v3[np.logical_not(np.isnan(v3))]
        notnan_cnt = np.count_nonzero(v3)
        print(x)
        cnt_null = data.filter((data[line] == "") | data[line].isNull() | isnan(data[line])).count()
        print(cnt_null, notnan_cnt)
        res_df = [str(line), x.min(), np.percentile(x, 25), np.mean(x), np.std(x), np.percentile(x, 75), x.max(), cnt_null]
        json_row = {key: value for key, value in zip(column_statistic, res_df)}
        json_output.append(json_row)
        print(res_df) 

with open("json_result.json", "w") as fp:
                json.dump(json_output, fp)

PySPark：如何從pyspark中的變量創建JSON和CSV文件？

問題描述

1 個解決方案

解決方案1
0 已采納 2018-12-04 10:33:37

PySPark：如何從pyspark中的變量創建JSON和CSV文件？

問題描述

1 個解決方案

解決方案1 0 已采納 2018-12-04 10:33:37

解決方案1
0 已采納 2018-12-04 10:33:37