簡體   English   中英

將包含字典列表的 dataframe 轉換為 pyspark 中的幾行

[英]Convert a dataframe containing a list of dictionaries to a several rows in pyspark

我有以下問題,我有一個 dataframe 包含兩列和字典列表。 我為我擁有的數據結構創建的方案如下:

        tick_by_tick_schema = StructType([
            StructField('localSymbol', StringType()),
            StructField('tickByTicks', ArrayType(StructType([
                StructField('price', StringType()),
                StructField('size', StringType()),
                StructField('specialConditions', StringType()),
            ]))),
            StructField('domBids', ArrayType(StructType([
                StructField('price_bid', StringType()),
                StructField('size_bid', StringType()),
                StructField('marketMaker_bid', StringType()),
            ])))
        ])

我的 dataframe 是這樣的:

+-----------+----------------+----------------------------------------------------------------------------------------+
|localSymbol|tickByTicks     |domBids                                                                                 |
+-----------+----------------+----------------------------------------------------------------------------------------+
|ALKT       |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|
+-----------+----------------+----------------------------------------------------------------------------------------+

現在我想得到的是這樣的:

+-----------+----------------+----------------------------------------------------------------------------------------+---------+---------------+-----+
|localSymbol|tickByTicks     |domBids                                                                                 |price_bid|marketMaker_bid|price|
+-----------+----------------+----------------------------------------------------------------------------------------+---------+---------------+-----+
|ALKT       |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.8     |CHX            |32.99|
|ALKT       |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.8     |MEMX           |32.99|
|ALKT       |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.8     |NYSENAT        |32.99|
|ALKT       |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.79    |NSDQ           |32.99|
|ALKT       |[{32.99, 100, }]|[{32.8, 1, CHX}, {32.8, 1, MEMX}, {32.8, 1, NYSENAT}, {32.79, 1, NSDQ}, {32.69, 1, BYX}]|32.69    |BYX            |32.99|
+-----------+----------------+----------------------------------------------------------------------------------------+---------+---------------+-----+

我試過這個,但顯然它不起作用xD

df = self.tick_by_tick_data_processed.select(f.col('localSymbol'),f.col('tickByTicks'),f.col('domBids'))\
    .withColumn('price_bid', f.explode(f.col('tickByTicks.price'))) \
    .withColumn('marketMaker_bid', f.explode(f.col('domBids.marketMaker_bid'))) \
    .withColumn('price_bid', f.explode(f.col('domBids.price_bid')))

這可能有效:

# It explodes and select all struct columns
df = self.tick_by_tick_data_processed \
    .withColumn('tick', f.explode(f.col('tickByTicks'))) \
    .withColumn('dom', f.explode(f.col('domBids'))) \
    .select('localSymbol', 'tick.*', 'dom.*')

# OR

# Selecting only desired columns
df = self.tick_by_tick_data_processed \
    .withColumn('tick', f.explode(f.col('tickByTicks'))) \
    .withColumn('dom', f.explode(f.col('domBids'))) \
    .select('localSymbol', 
            f.col('tick.price').alias('tick_price'), 
            f.col('dom.marketMaker_bid').alias('marketMaker_bid'),
            f.col('dom.price_bid').alias('price_bid'))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM