简体   繁体   English


[英]How to join dynamically named columns into dictionary?

Given these data frames: 鉴于这些数据框架:

Venue|Date    | 08 | 10 |
Hotel|20190101| 15 | 03 |
Beach|20190101| 93 | 45 |

Venue|Date    | 07 | 10 | 
Beach|20190101| 30 | 5  |
Hotel|20190103| 05 | 15 |

How can I possibly merge (full join) the two tables resulting in something as following without having to manually loop through each row of both tables? 我怎么可能合并(完全连接)两个表,导致如下所示,而不必手动循环遍历两个表的每一行?

 {"Venue":"Hotel", "Date":"20190101", "08":{ "IncomingCount":15 }, "10":{ "IncomingCount":03 } },
 {"Venue":"Beach", "Date":"20190101", "07":{ "OutgoingCount":30 }, "08":{ "IncomingCount":93 }, "10":{ "IncomingCount":45, "OutgoingCount":15 } },
 {"Venue":"Hotel", "Date":"20190103", "07":{ "OutgoingCount":05 }, "10":{ "OutgoingCount":15 } }

The conditions are: 条件是:

  1. Venue and Date columns act like join conditions. Venue和Date列的作用类似于连接条件。
  2. The other columns, represented in numbers, are dynamically created. 以数字表示的其他列是动态创建的。
  3. If dynamically column does not exist, it gets excluded( or included with None as value ). 如果动态列不存在,则会将其排除(或将None作为值包含在内)。

I can get this so far: 到目前为止,我可以得到这个:

import pandas as pd
import numpy as np

dd1 = {'venue': ['hotel', 'beach'], 'date':['20190101', '20190101'], '08': [15, 93], '10':[3, 45]}
dd2 = {'venue': ['beach', 'hotel'], 'date':['20190101', '20190103'], '07': [30, 5], '10':[5, 15]}

df1 = pd.DataFrame(data=dd1)
df2 = pd.DataFrame(data=dd2)

df1.columns = [f"IncomingCount:{x}" if x not in ['venue', 'date'] else x for x in df1.columns]
df2.columns = [f"OutgoingCount:{x}" if x not in ['venue', 'date'] else x for x in df2.columns ]

ll_dd = pd.merge(df1, df2, on=['venue', 'date'], how='outer').to_dict('records')
ll_dd = [{k:v for k,v in dd.items() if not pd.isnull(v)} for dd in ll_dd]


[{'venue': 'hotel',
  'date': '20190101',
  'IncomingCount:08': 15.0,
  'IncomingCount:10': 3.0},
 {'venue': 'beach',
  'date': '20190101',
  'IncomingCount:08': 93.0,
  'IncomingCount:10': 45.0,
  'OutgoingCount:07': 30.0,
  'OutgoingCount:10': 5.0},
 {'venue': 'hotel',
  'date': '20190103',
  'OutgoingCount:07': 5.0,
  'OutgoingCount:10': 15.0}]

it's pretty fiddly, but it can be done by making use of the create_map function from spark. 它非常繁琐,但可以通过使用spark中的create_map函数来完成。

basically divide the columns into four groups: keys (venue, date), common (10), only incoming (08), only outgoing (07). 基本上将列分为四组:键(场地,日期),普通(10),仅传入(08),仅传出(07)。

then create mappers per group (except keys), mapping only what's available per group. 然后为每个组创建映射器(键除外),仅映射每组可用的内容。 apply mapping, drop the old column and rename the mapped column to the old name. 应用映射,删除旧列并将映射列重命名为旧名称。

lastly convert all rows to dict (from df's rdd) and collect. 最后将所有行转换为dict(来自df的rdd)并收集。

from pyspark.sql import SparkSession
from pyspark.sql.functions import create_map, col, lit

spark = SparkSession.builder.appName('hotels_and_beaches').getOrCreate()

incoming_counts = spark.createDataFrame([('Hotel', 20190101, 15, 3), ('Beach', 20190101, 93, 45)], ['Venue', 'Date', '08', '10']).alias('inc')
outgoing_counts = spark.createDataFrame([('Beach', 20190101, 30, 5), ('Hotel', 20190103, 5, 15)], ['Venue', 'Date', '07', '10']).alias('out')

df = incoming_counts.join(outgoing_counts, on=['Venue', 'Date'], how='full')

outgoing_cols = {c for c in outgoing_counts.columns if c not in {'Venue', 'Date'}}
incoming_cols = {c for c in incoming_counts.columns if c not in {'Venue', 'Date'}}

common_cols = outgoing_cols.intersection(incoming_cols)

outgoing_cols = outgoing_cols.difference(common_cols)
incoming_cols = incoming_cols.difference(common_cols)

for c in common_cols:
    df = df.withColumn(
        c + '_new', create_map(
            lit('IncomingCount'), col('inc.{}'.format(c)),
            lit('OutgoingCount'), col('out.{}'.format(c)),
    ).drop(c).withColumnRenamed(c + '_new', c)

for c in incoming_cols:
    df = df.withColumn(
        c + '_new', create_map(
            lit('IncomingCount'), col('inc.{}'.format(c)),
    ).drop(c).withColumnRenamed(c + '_new', c)

for c in outgoing_cols:
    df = df.withColumn(
        c + '_new', create_map(
            lit('OutgoingCount'), col('out.{}'.format(c)),
    ).drop(c).withColumnRenamed(c + '_new', c)

result = df.coalesce(1).rdd.map(lambda r: r.asDict()).collect()

result: 结果:

[{'Venue': 'Hotel', 'Date': 20190101, '10': {'OutgoingCount': None, 'IncomingCount': 3}, '08': {'IncomingCount': 15}, '07': {'OutgoingCount': None}}, {'Venue': 'Hotel', 'Date': 20190103, '10': {'OutgoingCount': 15, 'IncomingCount': None}, '08': {'IncomingCount': None}, '07': {'OutgoingCount': 5}}, {'Venue': 'Beach', 'Date': 20190101, '10': {'OutgoingCount': 5, 'IncomingCount': 45}, '08': {'IncomingCount': 93}, '07': {'OutgoingCount': 30}}]

The final result as desired by the OP is a list of dictionaries , where all rows from the DataFrame which have same Venue and Date have been clubbed together. OP所需的最终结果是dictionaries list ,其中具有相同VenueDate的DataFrame中的所有行都被聚集在一起。

# Creating the DataFrames
df_Incoming = sqlContext.createDataFrame([('Hotel','20190101',15,3),('Beach','20190101',93,45)],('Venue','Date','08','10'))
|Venue|    Date| 08| 10|
|Hotel|20190101| 15|  3|
|Beach|20190101| 93| 45|
df_Outgoing = sqlContext.createDataFrame([('Beach','20190101',30,5),('Hotel','20190103',5,15)],('Venue','Date','07','10'))
|Venue|    Date| 07| 10|
|Beach|20190101| 30|  5|
|Hotel|20190103|  5| 15|

The idea is to create a dictionary from each row and have the all rows of the DataFrame stored as dictionaries in one big list . 我们的想法是从每一row创建一个dictionary ,并将DataFrame的所有rows存储为一个大list字典。 And as a final step, we club those dictionaries together which have same Venue and Date . 作为最后一步,我们将那些具有相同VenueDate词典联合起来。

Since, all rows in the DataFrame are stored as Row() objects, we use collect() function to return all records as list of Row() . 由于DataFrame中的所有rows存储为Row()对象,因此我们使用collect()函数将所有记录作为Row() list返回。 Just to illustrate the output - 只是为了说明输出 -

[Row(Venue='Hotel', Date='20190101', 08=15, 10=3), Row(Venue='Beach', Date='20190101', 08=93, 10=45)]

But, since we want list of dictionaries , we can use list comprehensions to convert them to a one - 但是,由于我们需要dictionaries list ,我们可以使用list comprehensions将它们转换为一个 -

list_Incoming = [row.asDict() for row in df_Incoming.collect()]
[{'10': 3, 'Date': '20190101', 'Venue': 'Hotel', '08': 15}, {'10': 45, 'Date': '20190101', 'Venue': 'Beach', '08': 93}]

But, since the numeric columns have been in the form like "08":{ "IncomingCount":15 } , instead of "08":15 , so we employ dictionary comprehensions to convert them into this form - 但是,由于数字列的形式类似于"08":{ "IncomingCount":15 } ,而不是"08":15 ,所以我们使用dictionary comprehensions将它们转换为这种形式 -

list_Incoming = [ {k:v if k in ['Venue','Date'] else {'IncomingCount':v} for k,v in dict_element.items()} for dict_element in list_Incoming]
[{'10': {'IncomingCount': 3}, 'Date': '20190101', 'Venue': 'Hotel', '08': {'IncomingCount': 15}}, {'10': {'IncomingCount': 45}, 'Date': '20190101', 'Venue': 'Beach', '08': {'IncomingCount': 93}}]

Similarly, we do for OutgoingCount 同样,我们为OutgoingCount

list_Outgoing = [row.asDict() for row in df_Outgoing.collect()]
list_Outgoing = [ {k:v if k in ['Venue','Date'] else {'OutgoingCount':v} for k,v in dict_element.items()} for dict_element in list_Outgoing]
[{'10': {'OutgoingCount': 5}, 'Date': '20190101', 'Venue': 'Beach', '07': {'OutgoingCount': 30}}, {'10': {'OutgoingCount': 15}, 'Date': '20190103', 'Venue': 'Hotel', '07': {'OutgoingCount': 5}}]

Final Step: Now, that we have created the requisite list of dictionaries , we need to club the list together on the basis of Venue and Date . 最后一步:现在,我们已经创建了必要的dictionaries list ,我们需要在VenueDate的基础上将列表组合在一起。

from copy import deepcopy
def merge_lists(list_Incoming, list_Outgoing):
    # create dictionary from list_Incoming:
    dict1 = {(record['Venue'], record['Date']): record  for record in list_Incoming}

    #compare elements in list_Outgoing to those on list_Incoming:

    result = {}
    for record in list_Outgoing:
        ckey = record['Venue'], record['Date']
        new_record = deepcopy(record)
        if ckey in dict1:
            for key, value in dict1[ckey].items():
                if key in ('Venue', 'Date'):
                    # Do not merge these keys
                # Dict's "setdefault" finds a key/value, and if it is missing
                # creates a new one with the second parameter as value
                new_record.setdefault(key, {}).update(value)

        result[ckey] = new_record

    # Add values from list_Incoming that were not matched in list_Outgoing:
    for key, value in dict1.items():
        if key not in result:
            result[key] = deepcopy(value)

    return list(result.values())

res = merge_lists(list_Incoming, list_Outgoing)
[{'10': {'OutgoingCount': 5, 'IncomingCount': 45}, 
  'Date': '20190101', 
  'Venue': 'Beach', 
  '08': {'IncomingCount': 93}, 
  '07': {'OutgoingCount': 30}

 {'10': {'OutgoingCount': 15}, 
   'Date': '20190103', 
   'Venue': 'Hotel', 
   '07': {'OutgoingCount': 5}

 {'10': {'IncomingCount': 3}, 
  'Date': '20190101', 
  'Venue': 'Hotel', 
  '08': {'IncomingCount': 15}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM