根据另一个df中的列计算一个df中的行数

Question

好的，所以我有第一个数据框 df1：

|timestamp                |ip         |
|2022-01-06 11:58:53+00:00|1.1.1.5.   |
|2022-01-08 03:56:35+00:00|10.10.10.24|
|2022-01-09 22:29:30+00:00|3.3.3.89.  |
|2022-03-08 22:37:52+00:00|8.8.8.88.  |

还有第二个数据框 df2：

|timestamp                |other|
|2022-01-07 22:08:59+00:00|other|
|2022-01-07 23:08:59+00:00|other|
|2022-01-09 17:04:09+00:00|other|
|2022-03-05 17:04:09+00:00|other|

我想根据 df1 中的 2 个连续时间戳计算 df2 中有多少行，这意味着：

|timestamp                |ip         |count|
|2022-01-06 11:58:53+00:00|1.1.1.5    |NaN  |
|2022-01-08 03:56:35+00:00|10.10.10.24|2    |
|2022-01-09 22:29:30+00:00|3.3.3.89   |1    |
|2022-03-08 22:37:52+00:00|8.8.8.88   |1    |

我尝试的是首先在 df1 中使用先前的时间戳创建另一列：

df1 = df1.assign(timestamp_b4=df1.timestamp.shift(1)).fillna({'timestamp_b4': df1.timestamp})

这给了我：

|timestamp                |ip         |timestamp_b4             |
|2022-01-06 11:58:53+00:00|1.1.1.5    |2022-03-08 22:37:52+00:00|
|2022-01-08 03:56:35+00:00|10.10.10.24|2022-01-06 11:58:53+00:00|
|2022-01-09 22:29:30+00:00|3.3.3.89   |2022-01-08 03:56:35+00:00|
|2022-03-08 22:37:52+00:00|8.8.8.88   |2022-01-09 22:29:30+00:00|

然后做某种

s = (df2[df2['timestamp'].between(df1['timestamp'], df1['timestamp_b4'])].size())

但不幸的是，它不起作用，因为 pandas 需要比较相同标记的对象。

有没有一种好的 pandas/pythonic 方法可以做到这一点？

谢谢

Answer 1

这是一种方法。 请注意，来自 df1 的列保留在最终输出 df 中：

从这个 df1 开始，它有一个额外的列：

                   timestamp           ip another_col
0  2022-01-06 11:58:53+00:00     1.1.1.5.       val_1
1  2022-01-08 03:56:35+00:00  10.10.10.24       val_2
2  2022-01-09 22:29:30+00:00    3.3.3.89.       val_3
3  2022-03-08 22:37:52+00:00    8.8.8.88.       val_4 

df1.merge(df2, on='timestamp', how='outer').sort_values('timestamp') \
    .assign(c1=df1.loc[~df1['ip'].isna()]['ip'], c2=lambda x: x['c1'].bfill() ) \
    .assign(count=lambda x: x.groupby('c2').apply('count').reset_index(drop=True)['timestamp']-1) \
    .drop(['other','c1','c2'], axis=1).dropna().astype({'count': 'int32'})

                   timestamp           ip another_col  count
0  2022-01-06 11:58:53+00:00     1.1.1.5.       val_1      0
1  2022-01-08 03:56:35+00:00  10.10.10.24       val_2      2
2  2022-01-09 22:29:30+00:00    3.3.3.89.       val_3      1
3  2022-03-08 22:37:52+00:00    8.8.8.88.       val_4      1

请注意， another_col保留在输出中。

这种方法合并然后按时间戳排序，然后创建另一列 - c2 - 用于复制 df1 时间戳，然后将其回填到 df2 时间戳。 从那里实例按 df1 时间戳（反映在 c2 列中）分组并计数。 换句话说，df1 时间戳的回填允许它用作分组键来计算前面的 df2 时间戳。 之后，df 被修剪回以匹配输出要求。

另请注意，使用这种方法，数据帧需要像我的示例中当前那样被索引为 0-n。

Answer 2

def time_compare(df1,df2):
  return [np.sum((df1['timestamp'].values[i-1] < df2['timestamp'].values) & (df1['timestamp'].values[i] > df2['timestamp'].values)) for i in range(len(df1.timestamp))]

df2.join(pd.Series(time_compare(df1,df2), name='Count'))

奇怪的是我不能像往常一样发布数据帧输出：

指数	时间戳	其他	数数
0	2022-01-07 22:08:5900:00	其他	0
1	2022-01-07 23:08:5900:00	其他	2
2	2022-01-09 17:04:0900:00	其他	1
3	2022-03-05 17:04:0900:00	其他	1

Answer 3

试试这个，这是你可以做些什么来找到解决方案的一个例子

import pandas as pd
table1 = {
    'timestamp':['2022-01-06 11:58:53+00:00','2022-01-08 03:56:35+00:00',
                 '2022-01-09 22:29:30+00:00','2022-03-08 22:37:52+00:00'],
    'other':['other','other','other','other']
              }
df1 = pd.DataFrame(table1)

table2 = {
    'timestamp':['2022-01-07 23:08:59+00:00','2022-01-07 22:08:59+00:00',
                 '2022-03-05 17:04:09+00:00','2022-01-09 17:04:09+00:00'],
    'ip':['1.1.1.5.','10.10.10.24','3.3.3.89.','8.8.8.88.']
    
              }

df2 = pd.DataFrame(table2)

print(f'\n\n-------------df1-----------\n\n')
print(df2)
print(f'\n\n-------------df2-----------\n\n')
print(df1)

listdf1 = df1['timestamp'].values.tolist()
def func(line):
    cont = df1.loc[df1['timestamp'].str.contains(line[0][:7], case = False)]
    temp = line.name - 1
    if temp == -1:
        temp = 0

    try :
        cont = [cont['timestamp'].iloc[temp],line[0]]
    except:
        cont = [line[0],line[0]]

    cont2 = df2['timestamp'].loc[df2['timestamp'].str.contains(line[0][:7], case = False)]
    
    repetitions = 0
    for x in cont2:

        if int(x[8:10]) >= int(cont[0][8:10]) and int(x[8:10]) <= int(cont[1][8:10]) and int(x[8:10]) <= int(line[0][8:10]):
            repetitions += 1
    return repetitions
    

print(f'\n\n-------------BREAK-----------\n\n')

df1['count'] = df1.apply(func, axis = 1)

print(df1)

Answer 4

好的，最后，这就是我所做的。 我用@Drakax 回答。

我用以前的时间戳创建了一个列

df1 = df1.assign(previous_deconnection=df1.timestamp.shift(1)).fillna({'previous_deconnection': df1.timestamp})

然后我设置第一行值，

df1['previous_deconnection'].iloc[0]=pd.to_datetime('2022-01-01 00:00:00+00:00')

然后我将此函数应用于 df1 的每一行

def time_compare(a,b):  
  return len(b[((b['timestamp'] >= a['previous_deconnection']) & (b['timestamp'] <= a['timestamp']))])

df1['Count'] = df1.apply(lambda row: time_compare(row, df2), axis=1)

根据另一个df中的列计算一个df中的行数

问题描述

4 个解决方案

解决方案1
1 2022-06-01 16:12:33

解决方案2
1 2022-06-01 16:49:47

解决方案3
0 2022-06-01 14:17:27

解决方案4
0 已采纳 2022-06-02 15:10:38

根据另一个df中的列计算一个df中的行数

问题描述

4 个解决方案

解决方案1 1 2022-06-01 16:12:33

解决方案2 1 2022-06-01 16:49:47

解决方案3 0 2022-06-01 14:17:27

解决方案4 0 已采纳 2022-06-02 15:10:38

解决方案1
1 2022-06-01 16:12:33

解决方案2
1 2022-06-01 16:49:47

解决方案3
0 2022-06-01 14:17:27

解决方案4
0 已采纳 2022-06-02 15:10:38