简体   繁体   中英

Parse json RDD into dataframe with Pyspark

I am new to Pyspark. From the code below I want to create a spark dataframe. It is difficult to parse it the correct way.

  1. How to parse it in a dataframe the right way?

  2. How can I parse it and get the following output?
    / /
    Desired output:

     date_added| price| +--------------------+--------------------+ | 2020-11-01| 10000|

The code:

conf = SparkConf().setAppName('rates').setMaster("local")
sc = SparkContext(conf=conf)

url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/quotes/latest'
parameters = {
      'symbol': 'BTC',
      'convert':'JPY'
}
headers = {
  'Accepts': 'application/json',
  'X-CMC_PRO_API_KEY': '***********************',
}

session = Session()
session.headers.update(headers)
try:
  response = session.get(url, params=parameters)
  json_rdd = sc.parallelize([response.text])
    #data = json.loads(response.text)
  #print(data)
except (ConnectionError, Timeout, TooManyRedirects) as e:
  print(e)


sqlContext = SQLContext(sc)
json_df = sqlContext.read.json(json_rdd)
json_df.show()

The output dataframe:

|                data|              status|
+--------------------+--------------------+
|[[18557275, 1, 20...|[1, 18, 0,,, 2020...|

JSON schema:

root
 |-- data: struct (nullable = true)
 |    |-- BTC: struct (nullable = true)
 |    |    |-- circulating_supply: long (nullable = true)
 |    |    |-- cmc_rank: long (nullable = true)
 |    |    |-- date_added: string (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- is_active: long (nullable = true)
 |    |    |-- is_fiat: long (nullable = true)
 |    |    |-- last_updated: string (nullable = true)
 |    |    |-- max_supply: long (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- num_market_pairs: long (nullable = true)
 |    |    |-- platform: string (nullable = true)
 |    |    |-- quote: struct (nullable = true)
 |    |    |    |-- JPY: struct (nullable = true)
 |    |    |    |    |-- last_updated: string (nullable = true)
 |    |    |    |    |-- market_cap: double (nullable = true)
 |    |    |    |    |-- percent_change_1h: double (nullable = true)
 |    |    |    |    |-- percent_change_24h: double (nullable = true)
 |    |    |    |    |-- percent_change_7d: double (nullable = true)
 |    |    |    |    |-- price: double (nullable = true)
 |    |    |    |    |-- volume_24h: double (nullable = true)
 |    |    |-- slug: string (nullable = true)
 |    |    |-- symbol: string (nullable = true)
 |    |    |-- tags: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- total_supply: long (nullable = true)
 |-- status: struct (nullable = true)
 |    |-- credit_count: long (nullable = true)
 |    |-- elapsed: long (nullable = true)
 |    |-- error_code: long (nullable = true)
 |    |-- error_message: string (nullable = true)
 |    |-- notice: string (nullable = true)
 |    |-- timestamp: string (nullable = true)

It looks like you've parsed it correctly. You can access the nested elements using the dot notation:

json_df.select(
    F.col('data.BTC.date_added').alias('date_added'),
    F.col('data.BTC.quote.JPY.price').alias('price')
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM