简体   繁体   中英

Pandas: Pythonic way to find rows with matching values in multiple columns (hierarchical conditions)

Sorry for the somewhat unclear title. Words failed me to succinctly describe the question. Hopefully my description below can help clarify. Any clarifying edit to the title is welcomed.

I am trying to create a networkx flow diagram from a pandas dataframe. The dataframe records how an order flows through multiple firms. Most of the rows in the dataframe are connected and the connections are manifested in multiple columns. Sample data is as below:

df = pd.DataFrame({'Company': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
              'event_type':['new', 'route', 'receive', 'execute', 'route', 'receive', 'execute'],
             'event_id': ['110', '120', '200', '210', '220', '300', '310'],
             'prior_event_id': [np.nan, '110', np.nan, '120', '210', np.nan, '300'],
             'route_id': [np.nan, 'foo', 'foo', np.nan, 'bar', 'bar', np.nan]}
             )

The dataframe looks like below:

  Company event_type event_id prior_event_id route_id
0       A        new      110            NaN      NaN
1       A      route      120            110      foo
2       B    receive      200            NaN      foo
3       B    execute      210            120      NaN
4       B      route      220            210      bar
5       C    receive      300            NaN      bar
6       C    execute      310            300      NaN

The order goes through 3 companies: A, B, C. And within each firm, the later event can be linked to its source event by event_id - prior_event_id pair. But such method will not work for records that belong to different companies. Row 1 and 2, for instance, will only be matched via one column route_id . Therefore the linking mechanism I'm trying to recreate is kind of hierarchical, in that I will only use column route_id to match if the event_id - prior_event_id column pair doesn't yield anything.

Picture below may help illustrate the linking mechanism: 示例图

My solution is quite clunky:

# Make every event unique so as to not confound the linking
df['event_sub'] = df.groupby(df.event_type).cumcount()+1 
df['event'] = df.event_type + ' ' + df.event_sub.astype(str) 

# Find the match based on first matching criterion
replace_dict_event = dict(df[['event_id', 'event']].values)
df['source'] = df['prior_event_id'].apply(lambda x: replace_dict_event.get(x) if replace_dict_event.get(x) else np.nan )
df['target'] = df['event_id'].apply(lambda x: replace_dict_event.get(x) if replace_dict_event.get(x) else np.nan )

# From last step, find the match based on second matching criterion for the unmatched rows 
replace_dict_rtd = dict(df[df.event_type == 'route'][['route_id', 'event']].values)
df.loc[df.event_type == 'receive', 'source'] = df[df.event_type == 'receive']['route_id'].apply(lambda x: replace_dict_rtd.get(x))
df

I essentially used apply twice to get the match step by step. I wonder if there is a cleaner, more Pythonic way to do it.

My result is shown below: 重1

And the networkx diagram I created from this:

流动

You have two different types of links: a) links that are defined by matching prior_event_id and event_id , and b) links that are defined by route_id . Using two different sets of commands to extract the two different types of relations is pythonic (or just plain good coding practice).

That being said, since you are dealing with tabular data, it would probably be better to use merges (specifically inner joins) to extract your links -- rather than using dictionary look ups with apply. Databases for tabular data are optimized for that sort of queries whereas your look ups will be much slower for large data sets.

#!/usr/bin/env python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

if __name__ == '__main__':

    df = pd.DataFrame({'Company': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
                       'event_type':['new', 'route', 'receive', 'execute', 'route', 'receive', 'execute'],
                       'event_id': ['110', '120', '200', '210', '220', '300', '310'],
                       'prior_event_id': [np.nan, '110', np.nan, '120', '210', np.nan, '300'],
                       'route_id': [np.nan, 'foo', 'foo', np.nan, 'bar', 'bar', np.nan]}
    )

    # --------------------------------------------------------------------------------
    # a) links established by matching event_id with prior_event_id
    df2 = pd.merge(df, df, left_on='event_id', right_on='prior_event_id', how='inner')

    #       Company_x event_type_x event_id_x prior_event_id_x route_id_x Company_y event_type_y event_id_y prior_event_id_y route_id_y
    # 0         A          new        110              NaN        NaN         A        route        120              110        foo
    # 1         A        route        120              110        foo         B      execute        210              120        NaN
    # 2         B      execute        210              120        NaN         B        route        220              210        bar
    # 3         C      receive        300              NaN        bar         C      execute        310              300        NaN

    # --------------------------------------------------------------------------------
    # b) links established by matching route_id

    # remove events without route ids
    valid = df['route_id'].notna()
    df3 = df['valid']

    #   Company event_type event_id prior_event_id route_id
    # 1       A      route      120            110      foo
    # 2       B    receive      200            NaN      foo
    # 4       B      route      220            210      bar
    # 5       C    receive      300            NaN      bar

    # join on route_id
    df4 = pd.merge(df3, df3, on='route_id', how='inner')

    #   Company_x event_type_x event_id_x prior_event_id_x route_id Company_y event_type_y event_id_y prior_event_id_y
    # 0         A        route        120              110      foo         A        route        120              110
    # 1         A        route        120              110      foo         B      receive        200              NaN
    # 2         B      receive        200              NaN      foo         A        route        120              110
    # 3         B      receive        200              NaN      foo         B      receive        200              NaN
    # 4         B        route        220              210      bar         B        route        220              210
    # 5         B        route        220              210      bar         C      receive        300              NaN
    # 6         C      receive        300              NaN      bar         B        route        220              210
    # 7         C      receive        300              NaN      bar         C      receive        300              NaN

    # remove cases where a company was matched to itself
    valid = df4['Company_x'] != df4['Company_y']
    df5 = df4[valid]

    #       Company_x event_type_x event_id_x prior_event_id_x route_id Company_y event_type_y event_id_y prior_event_id_y
    # 1         A        route        120              110      foo         B      receive        200              NaN
    # 2         B      receive        200              NaN      foo         A        route        120              110
    # 5         B        route        220              210      bar         C      receive        300              NaN
    # 6         C      receive        300              NaN      bar         B        route        220              210

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM