[英]Parse json RDD into dataframe with Pyspark
I am new to Pyspark.我是 Pyspark 的新手。 From the code below I want to create a spark dataframe.从下面的代码中,我想创建一个火花 dataframe。 It is difficult to parse it the correct way.很难以正确的方式解析它。
How to parse it in a dataframe the right way?如何以正确的方式在 dataframe 中解析它?
How can I parse it and get the following output?如何解析它并获得以下 output?
/ / //
Desired output:所需的 output:
date_added| price| +--------------------+--------------------+ | 2020-11-01| 10000|
The code:编码:
conf = SparkConf().setAppName('rates').setMaster("local")
sc = SparkContext(conf=conf)
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest'
parameters = {
'symbol': 'BTC',
'convert':'JPY'
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': '***********************',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
json_rdd = sc.parallelize([response.text])
#data = json.loads(response.text)
#print(data)
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)
sqlContext = SQLContext(sc)
json_df = sqlContext.read.json(json_rdd)
json_df.show()
The output dataframe: output dataframe:
| data| status|
+--------------------+--------------------+
|[[18557275, 1, 20...|[1, 18, 0,,, 2020...|
JSON schema: JSON 架构:
root
|-- data: struct (nullable = true)
| |-- BTC: struct (nullable = true)
| | |-- circulating_supply: long (nullable = true)
| | |-- cmc_rank: long (nullable = true)
| | |-- date_added: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- is_active: long (nullable = true)
| | |-- is_fiat: long (nullable = true)
| | |-- last_updated: string (nullable = true)
| | |-- max_supply: long (nullable = true)
| | |-- name: string (nullable = true)
| | |-- num_market_pairs: long (nullable = true)
| | |-- platform: string (nullable = true)
| | |-- quote: struct (nullable = true)
| | | |-- JPY: struct (nullable = true)
| | | | |-- last_updated: string (nullable = true)
| | | | |-- market_cap: double (nullable = true)
| | | | |-- percent_change_1h: double (nullable = true)
| | | | |-- percent_change_24h: double (nullable = true)
| | | | |-- percent_change_7d: double (nullable = true)
| | | | |-- price: double (nullable = true)
| | | | |-- volume_24h: double (nullable = true)
| | |-- slug: string (nullable = true)
| | |-- symbol: string (nullable = true)
| | |-- tags: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- total_supply: long (nullable = true)
|-- status: struct (nullable = true)
| |-- credit_count: long (nullable = true)
| |-- elapsed: long (nullable = true)
| |-- error_code: long (nullable = true)
| |-- error_message: string (nullable = true)
| |-- notice: string (nullable = true)
| |-- timestamp: string (nullable = true)
It looks like you've parsed it correctly.看起来您已经正确解析了它。 You can access the nested elements using the dot notation:您可以使用点符号访问嵌套元素:
json_df.select(
F.col('data.BTC.date_added').alias('date_added'),
F.col('data.BTC.quote.JPY.price').alias('price')
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.