简体   繁体   中英

Iterate over list elements in Pandas dataframe column and match with values in a different dataframe

I have two dataframes, I want to iterate over the elements in each list in the Companies column and match it with the company names in my second dataframe only if the date from the first dataframe occurs after the date of the second dataframe. I want two columns for the name matches and two columns for the date matches returned.

df = pd.DataFrame(columns=['Customer','Companies', 'Date'])
df = df.append({'Customer':'Gold', 'Companies':['Gold Ltd', 'Gold X', 'Gold De'], 'Date':'2019-01-07'}, ignore_index=True)
df = df.append({'Customer':'Micro', 'Companies':['Microf', 'Micro Inc', 'Micre'], 'Date':'2019-02-10'}, ignore_index=True)


Customer    Companies                     Date
0   Gold    [Gold Ltd, Gold X, Gold De] 2019-01-07
1   Micro   [Microf, Micro Inc, Micre]  2019-02-10


df2 = pd.DataFrame(columns=['Companies', 'Date'])
df2 = df2.append({'Companies':'Gold Ltd', 'Date':'2019-01-01'}, ignore_index=True)
df2 = df2.append({'Companies':'Gold X', 'Date':'2020-01-07'}, ignore_index=True)
df2 = df2.append({'Companies': 'Gold De', 'Date':'2018-07-07'}, ignore_index=True)
df2 = df2.append({'Companies':'Microf', 'Date':'2019-02-18'}, ignore_index=True)
df2 = df2.append({'Companies':'Micro Inc', 'Date':'2017-09-27'}, ignore_index=True)
df2 = df2.append({'Companies':'Micre', 'Date':'2018-12-11'}, ignore_index=True)

Companies         Date
0   Gold Ltd    2019-01-01
1   Gold X      2020-01-07
2   Gold De     2018-07-07
3   Microf      2019-02-18
4   Micro Inc   2017-09-27
5   Micre       2018-12-11


def match_it(d1, d2):
    for companies in d1['Companies']:
        for company in companies:
            if d2['Companies'].str.contains(company).any():
                mask = d1.Companies.apply(lambda x: company in x)
                dff = d1[mask]
                date1 = datetime.strptime(dff['Date'].values[0], '%Y-%m-%d').date()
                date2 = datetime.strptime(d2[d2['Companies']==company]['Date'].values[0], '%Y-%m-%d').date()

                if date2 < date1:
                    print(d2[d2['Companies']==company])
                    new_row = pd.Series([d2[d2['Companies']==company]['Date'], d2[d2['Companies']==company]['Companies']])
                    return new_row

Desired Output:

Customer    Companies                 Date       Name_1       Date_1      Name_2      Date_2    
Gold    [Gold Ltd, Gold X, Gold De] 2019-01-07   Gold Ltd   2019-01-01  Gold De      2018-07-07
Micro   [Microf, Micro Inc, Micre]  2019-02-10   Micro Inc  2017-09-27  Micre       2018-12-11

Start from more pandasonic way to convert Date columns in both DataFrames from string do datetime :

df.Date = pd.to_datetime(df.Date)
df2.Date = pd.to_datetime(df2.Date)

Then proceed as follows:

df3 = df.explode('Companies')
df3 = df3.merge(df2, on='Companies', suffixes=('_x', ''))
df3 = df3[df3.Date_x > df3.Date].drop(columns='Date_x')
df3.rename(columns={'Companies': 'Name'}, inplace=True)
df3['idx'] = df3.groupby('Customer').cumcount()
df3 = df3.pivot(index='Customer',columns='idx')
df3 = df3.swaplevel(axis=1)
df3 = df3.sort_index(axis=1, ascending=[True, False])
cols = []
for i in range(1, df3.columns.size // 2 + 1):
    cols.extend(['Name_' + str(i), 'Date_' + str(i)])
df3.columns = cols
result = df.merge(df3, how='left', left_on='Customer', right_index=True)

The result is just as you want.

To comprehend the details run each instruction separately and print the result. It is better to see the result on your own than read the description.

Caution: Explode is a relatively new function, added in Pandas version 0.25 . If you have older version of Pandas , start from upgrading it.

Edit following the comment as of 03:25:19Z

df1 can have more columns.

To test it, I added Xxx column to df1 . The only change required in this case is to block these additional columns from copying to df3 . To do this, the first instruction should be appended with:

.drop(columns=['Xxx'])

(in general case, replace 'Xxx' with the actual list of additional columns).

To check the case of different number of output columns, I changed the Date for Gold X company in df2 to 2019-01-06 , so that this company will also be included in the output.

For your data, with the above changes, the result is:

  Customer                    Companies       Date   Xxx     Name_1     Date_1  Name_2     Date_2   Name_3     Date_3
0     Gold  [Gold Ltd, Gold X, Gold De] 2019-01-07  Xxx1   Gold Ltd 2019-01-01  Gold X 2019-01-06  Gold De 2018-07-07
1    Micro   [Microf, Micro Inc, Micre] 2019-02-10  Xxx2  Micro Inc 2017-09-27   Micre 2018-12-11      NaN        NaT

So, as you can see:

  • The result contais also the added column ( Xxx ).
  • The output contains also Name_3 and Date_3 columns.
  • As for the second row from df1 only 2 matches have been found, these columns contain here NaN and NaT ( Pandas counterparts for None ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM