简体   繁体   中英

count number of rows in one df depending on column in another df

Ok so I have a first dataframe df1:

|timestamp                |ip         |
|2022-01-06 11:58:53+00:00|1.1.1.5.   |
|2022-01-08 03:56:35+00:00|10.10.10.24|
|2022-01-09 22:29:30+00:00|3.3.3.89.  |
|2022-03-08 22:37:52+00:00|8.8.8.88.  |

And a second dataframe, df2:

|timestamp                |other|
|2022-01-07 22:08:59+00:00|other|
|2022-01-07 23:08:59+00:00|other|
|2022-01-09 17:04:09+00:00|other|
|2022-03-05 17:04:09+00:00|other|

And I would like to count how many rows there is in df2 depending on the 2 consecutive timestamps in df1, meaning:

|timestamp                |ip         |count|
|2022-01-06 11:58:53+00:00|1.1.1.5    |NaN  |
|2022-01-08 03:56:35+00:00|10.10.10.24|2    |
|2022-01-09 22:29:30+00:00|3.3.3.89   |1    |
|2022-03-08 22:37:52+00:00|8.8.8.88   |1    |

What I tried is to first create another column in df1 with the previous timestamp with this:

df1 = df1.assign(timestamp_b4=df1.timestamp.shift(1)).fillna({'timestamp_b4': df1.timestamp})

which gives me:

|timestamp                |ip         |timestamp_b4             |
|2022-01-06 11:58:53+00:00|1.1.1.5    |2022-03-08 22:37:52+00:00|
|2022-01-08 03:56:35+00:00|10.10.10.24|2022-01-06 11:58:53+00:00|
|2022-01-09 22:29:30+00:00|3.3.3.89   |2022-01-08 03:56:35+00:00|
|2022-03-08 22:37:52+00:00|8.8.8.88   |2022-01-09 22:29:30+00:00|

and then do some sort of

s = (df2[df2['timestamp'].between(df1['timestamp'], df1['timestamp_b4'])].size())

But unfortunately it does not work since pandas requires comparing identically-labeled objects.

Is there a good pandas/pythonic way to do that?

Thanks

Here is one approach. Note that columns from df1 are retained in the final output df:

Starting with this df1 that has an additional column:

                   timestamp           ip another_col
0  2022-01-06 11:58:53+00:00     1.1.1.5.       val_1
1  2022-01-08 03:56:35+00:00  10.10.10.24       val_2
2  2022-01-09 22:29:30+00:00    3.3.3.89.       val_3
3  2022-03-08 22:37:52+00:00    8.8.8.88.       val_4 

df1.merge(df2, on='timestamp', how='outer').sort_values('timestamp') \
    .assign(c1=df1.loc[~df1['ip'].isna()]['ip'], c2=lambda x: x['c1'].bfill() ) \
    .assign(count=lambda x: x.groupby('c2').apply('count').reset_index(drop=True)['timestamp']-1) \
    .drop(['other','c1','c2'], axis=1).dropna().astype({'count': 'int32'})

                   timestamp           ip another_col  count
0  2022-01-06 11:58:53+00:00     1.1.1.5.       val_1      0
1  2022-01-08 03:56:35+00:00  10.10.10.24       val_2      2
2  2022-01-09 22:29:30+00:00    3.3.3.89.       val_3      1
3  2022-03-08 22:37:52+00:00    8.8.8.88.       val_4      1

Note that another_col is retained in the output.

This approach merges then sorts by timestamp then creates another column - c2 - that is used to copy the df1 timestamp and then backfill it against the df2 timestamps. From there the instances are grouped by df1 timestamp (reflected in the c2 column) and counted. In other words, the backfill of the df1 timestamp allows it to be used as a grouping key for counting the preceding df2 timestamps. After that the df is trimmed back down to match the output requirements.

Also note that with this approach the dataframes need to be indexed 0-n like they currently are in my example.

def time_compare(df1,df2):
  return [np.sum((df1['timestamp'].values[i-1] < df2['timestamp'].values) & (df1['timestamp'].values[i] > df2['timestamp'].values)) for i in range(len(df1.timestamp))]

df2.join(pd.Series(time_compare(df1,df2), name='Count'))

Weird I can't post the dataframe output as usual:

index timestamp other Count
0 2022-01-07 22:08:5900:00 other 0
1 2022-01-07 23:08:5900:00 other 2
2 2022-01-09 17:04:0900:00 other 1
3 2022-03-05 17:04:0900:00 other 1

Try this, it's an example of what you can do to find the solution

import pandas as pd
table1 = {
    'timestamp':['2022-01-06 11:58:53+00:00','2022-01-08 03:56:35+00:00',
                 '2022-01-09 22:29:30+00:00','2022-03-08 22:37:52+00:00'],
    'other':['other','other','other','other']
              }
df1 = pd.DataFrame(table1)

table2 = {
    'timestamp':['2022-01-07 23:08:59+00:00','2022-01-07 22:08:59+00:00',
                 '2022-03-05 17:04:09+00:00','2022-01-09 17:04:09+00:00'],
    'ip':['1.1.1.5.','10.10.10.24','3.3.3.89.','8.8.8.88.']
    
              }

df2 = pd.DataFrame(table2)

print(f'\n\n-------------df1-----------\n\n')
print(df2)
print(f'\n\n-------------df2-----------\n\n')
print(df1)

listdf1 = df1['timestamp'].values.tolist()
def func(line):
    cont = df1.loc[df1['timestamp'].str.contains(line[0][:7], case = False)]
    temp = line.name - 1
    if temp == -1:
        temp = 0

    try :
        cont = [cont['timestamp'].iloc[temp],line[0]]
    except:
        cont = [line[0],line[0]]

    cont2 = df2['timestamp'].loc[df2['timestamp'].str.contains(line[0][:7], case = False)]
    
    repetitions = 0
    for x in cont2:

        if int(x[8:10]) >= int(cont[0][8:10]) and int(x[8:10]) <= int(cont[1][8:10]) and int(x[8:10]) <= int(line[0][8:10]):
            repetitions += 1
    return repetitions
    

print(f'\n\n-------------BREAK-----------\n\n')

df1['count'] = df1.apply(func, axis = 1)

print(df1)

Ok so in the end, here's what I did. I used @Drakax answer.

I created a column with the previous timestamps

df1 = df1.assign(previous_deconnection=df1.timestamp.shift(1)).fillna({'previous_deconnection': df1.timestamp})

then I set the first row value,

df1['previous_deconnection'].iloc[0]=pd.to_datetime('2022-01-01 00:00:00+00:00')

then I apply this function to each row of df1

def time_compare(a,b):  
  return len(b[((b['timestamp'] >= a['previous_deconnection']) & (b['timestamp'] <= a['timestamp']))])

df1['Count'] = df1.apply(lambda row: time_compare(row, df2), axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM