简体   繁体   English

根据另一个df中的列计算一个df中的行数

[英]count number of rows in one df depending on column in another df

Ok so I have a first dataframe df1:好的,所以我有第一个数据框 df1:

|timestamp                |ip         |
|2022-01-06 11:58:53+00:00|1.1.1.5.   |
|2022-01-08 03:56:35+00:00|10.10.10.24|
|2022-01-09 22:29:30+00:00|3.3.3.89.  |
|2022-03-08 22:37:52+00:00|8.8.8.88.  |

And a second dataframe, df2:还有第二个数据框 df2:

|timestamp                |other|
|2022-01-07 22:08:59+00:00|other|
|2022-01-07 23:08:59+00:00|other|
|2022-01-09 17:04:09+00:00|other|
|2022-03-05 17:04:09+00:00|other|

And I would like to count how many rows there is in df2 depending on the 2 consecutive timestamps in df1, meaning:我想根据 df1 中的 2 个连续时间戳计算 df2 中有多少行,这意味着:

|timestamp                |ip         |count|
|2022-01-06 11:58:53+00:00|1.1.1.5    |NaN  |
|2022-01-08 03:56:35+00:00|10.10.10.24|2    |
|2022-01-09 22:29:30+00:00|3.3.3.89   |1    |
|2022-03-08 22:37:52+00:00|8.8.8.88   |1    |

What I tried is to first create another column in df1 with the previous timestamp with this:我尝试的是首先在 df1 中使用先前的时间戳创建另一列:

df1 = df1.assign(timestamp_b4=df1.timestamp.shift(1)).fillna({'timestamp_b4': df1.timestamp})

which gives me:这给了我:

|timestamp                |ip         |timestamp_b4             |
|2022-01-06 11:58:53+00:00|1.1.1.5    |2022-03-08 22:37:52+00:00|
|2022-01-08 03:56:35+00:00|10.10.10.24|2022-01-06 11:58:53+00:00|
|2022-01-09 22:29:30+00:00|3.3.3.89   |2022-01-08 03:56:35+00:00|
|2022-03-08 22:37:52+00:00|8.8.8.88   |2022-01-09 22:29:30+00:00|

and then do some sort of然后做某种

s = (df2[df2['timestamp'].between(df1['timestamp'], df1['timestamp_b4'])].size())

But unfortunately it does not work since pandas requires comparing identically-labeled objects.但不幸的是,它不起作用,因为 pandas 需要比较相同标记的对象。

Is there a good pandas/pythonic way to do that?有没有一种好的 pandas/pythonic 方法可以做到这一点?

Thanks谢谢

Here is one approach.这是一种方法。 Note that columns from df1 are retained in the final output df:请注意,来自 df1 的列保留在最终输出 df 中:

Starting with this df1 that has an additional column:从这个 df1 开始,它有一个额外的列:

                   timestamp           ip another_col
0  2022-01-06 11:58:53+00:00     1.1.1.5.       val_1
1  2022-01-08 03:56:35+00:00  10.10.10.24       val_2
2  2022-01-09 22:29:30+00:00    3.3.3.89.       val_3
3  2022-03-08 22:37:52+00:00    8.8.8.88.       val_4 

df1.merge(df2, on='timestamp', how='outer').sort_values('timestamp') \
    .assign(c1=df1.loc[~df1['ip'].isna()]['ip'], c2=lambda x: x['c1'].bfill() ) \
    .assign(count=lambda x: x.groupby('c2').apply('count').reset_index(drop=True)['timestamp']-1) \
    .drop(['other','c1','c2'], axis=1).dropna().astype({'count': 'int32'})

                   timestamp           ip another_col  count
0  2022-01-06 11:58:53+00:00     1.1.1.5.       val_1      0
1  2022-01-08 03:56:35+00:00  10.10.10.24       val_2      2
2  2022-01-09 22:29:30+00:00    3.3.3.89.       val_3      1
3  2022-03-08 22:37:52+00:00    8.8.8.88.       val_4      1

Note that another_col is retained in the output.请注意, another_col保留在输出中。

This approach merges then sorts by timestamp then creates another column - c2 - that is used to copy the df1 timestamp and then backfill it against the df2 timestamps.这种方法合并然后按时间戳排序,然后创建另一列 - c2 - 用于复制 df1 时间戳,然后将其回填到 df2 时间戳。 From there the instances are grouped by df1 timestamp (reflected in the c2 column) and counted.从那里实例按 df1 时间戳(反映在 c2 列中)分组并计数。 In other words, the backfill of the df1 timestamp allows it to be used as a grouping key for counting the preceding df2 timestamps.换句话说,df1 时间戳的回填允许它用作分组键来计算前面的 df2 时间戳。 After that the df is trimmed back down to match the output requirements.之后,df 被修剪回以匹配输出要求。

Also note that with this approach the dataframes need to be indexed 0-n like they currently are in my example.另请注意,使用这种方法,数据帧需要像我的示例中当前那样被索引为 0-n。

def time_compare(df1,df2):
  return [np.sum((df1['timestamp'].values[i-1] < df2['timestamp'].values) & (df1['timestamp'].values[i] > df2['timestamp'].values)) for i in range(len(df1.timestamp))]

df2.join(pd.Series(time_compare(df1,df2), name='Count'))

Weird I can't post the dataframe output as usual:奇怪的是我不能像往常一样发布数据帧输出:

index指数 timestamp时间戳 other其他 Count数数
0 0 2022-01-07 22:08:5900:00 2022-01-07 22:08:5900:00 other其他 0 0
1 1 2022-01-07 23:08:5900:00 2022-01-07 23:08:5900:00 other其他 2 2
2 2 2022-01-09 17:04:0900:00 2022-01-09 17:04:0900:00 other其他 1 1
3 3 2022-03-05 17:04:0900:00 2022-03-05 17:04:0900:00 other其他 1 1

Try this, it's an example of what you can do to find the solution试试这个,这是你可以做些什么来找到解决方案的一个例子

import pandas as pd
table1 = {
    'timestamp':['2022-01-06 11:58:53+00:00','2022-01-08 03:56:35+00:00',
                 '2022-01-09 22:29:30+00:00','2022-03-08 22:37:52+00:00'],
    'other':['other','other','other','other']
              }
df1 = pd.DataFrame(table1)

table2 = {
    'timestamp':['2022-01-07 23:08:59+00:00','2022-01-07 22:08:59+00:00',
                 '2022-03-05 17:04:09+00:00','2022-01-09 17:04:09+00:00'],
    'ip':['1.1.1.5.','10.10.10.24','3.3.3.89.','8.8.8.88.']
    
              }

df2 = pd.DataFrame(table2)

print(f'\n\n-------------df1-----------\n\n')
print(df2)
print(f'\n\n-------------df2-----------\n\n')
print(df1)

listdf1 = df1['timestamp'].values.tolist()
def func(line):
    cont = df1.loc[df1['timestamp'].str.contains(line[0][:7], case = False)]
    temp = line.name - 1
    if temp == -1:
        temp = 0

    try :
        cont = [cont['timestamp'].iloc[temp],line[0]]
    except:
        cont = [line[0],line[0]]

    cont2 = df2['timestamp'].loc[df2['timestamp'].str.contains(line[0][:7], case = False)]
    
    repetitions = 0
    for x in cont2:

        if int(x[8:10]) >= int(cont[0][8:10]) and int(x[8:10]) <= int(cont[1][8:10]) and int(x[8:10]) <= int(line[0][8:10]):
            repetitions += 1
    return repetitions
    

print(f'\n\n-------------BREAK-----------\n\n')

df1['count'] = df1.apply(func, axis = 1)

print(df1)

Ok so in the end, here's what I did.好的,最后,这就是我所做的。 I used @Drakax answer.我用@Drakax 回答。

I created a column with the previous timestamps我用以前的时间戳创建了一个列

df1 = df1.assign(previous_deconnection=df1.timestamp.shift(1)).fillna({'previous_deconnection': df1.timestamp})

then I set the first row value,然后我设置第一行值,

df1['previous_deconnection'].iloc[0]=pd.to_datetime('2022-01-01 00:00:00+00:00')

then I apply this function to each row of df1然后我将此函数应用于 df1 的每一行

def time_compare(a,b):  
  return len(b[((b['timestamp'] >= a['previous_deconnection']) & (b['timestamp'] <= a['timestamp']))])

df1['Count'] = df1.apply(lambda row: time_compare(row, df2), axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM