简体   繁体   English

将行与条件进行比较并在 Pandas 中生成新的 dataframe

[英]Compare rows with conditions and generate a new dataframe in Pandas

I have a very big dataframe with this structure:我有一个非常大的 dataframe 结构:

Timestamp    Val1

Here you can see a real sample:在这里你可以看到一个真实的样本:

    Timestamp           Temp         
0   1622471518.92911    36.443       
1   1622471525.034114   36.445       
2   1622471531.148139   37.447      
3   1622471537.284337   36.449      
4   1622471543.622588   43.345      
5   1622471549.734765   36.451      
6   1622471556.2518     36.454      
7   1622471562.361368   41.461     
8   1622471568.472718   42.468   
9   1622471574.826475   36.470

What I want to do is compare the Temp column with itself and if is higher than "X", for example 4, and the time between they is lower than "Y", for example 180 min, then I save some data of they.我想要做的是将Temp列与其自身进行比较,如果高于“X”,例如 4,并且它们之间的时间低于“Y”,例如 180 分钟,那么我保存它们的一些数据。

Now I'm using two for loops one inside the other, but this expends to much time and usually pandas has an option to avoid this.现在我使用两个for循环一个在另一个里面,但这会花费很多时间,通常pandas可以选择避免这种情况。

This is my code:这是我的代码:

cap_time, maxim = 180, 4
cap_time = cap_time * 60
temps= df['Temperature'].values
times = df['Timestamp'].values

results = []
for i in range(len(temps)):
    for j in range(i+1, len(temps)):
        print(i,j,len(temps))
        if float(temps[j]) > float(temps[i])*maxim:
            timeIn = dt.datetime.fromtimestamp(float(times[i]))
            timeOut = dt.datetime.fromtimestamp(float(times[j]))
            diff = timeOut - timeIn
            tdiff = diff.total_seconds()
            
            if dd > cap_time:
                break
            else:
                res = [temps[i], temps[j], times[i], times[j], tdiff/60, cap_time/60, maxim]
                results.append(res)
                break

# Then I save it in a dataframe and another actions

Can Pandas help me to achieve my goal and reduce the execution time? Pandas能否帮助我实现目标并减少执行时间? I found dataFrame.diff() but I'm not sure is what I want (or I don`t know how to use it).我找到dataFrame.diff()但我不确定我想要什么(或者我不知道如何使用它)。

Thank you very much.非常感谢。

Short of avoiding the nested for loops, you can already speed things up by avoiding all unnecessary calculations and conversions within the loops.除了避免嵌套for循环之外,您已经可以通过避免循环中所有不必要的计算和转换来加快速度。 In particular, you can use NumPy broadcasting to define a Boolean array beforehand, in which you can look up whether the condition is met:特别是可以使用 NumPy广播预先定义一个 Boolean 数组,在其中可以查看是否满足条件:

import numpy as np

temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]

condition = np.logical_and(temps_diff > maxim, 
                           times_diff < cap_time)

results = []
for i in range(len(temps)):
    for j in range(i+1, len(temps)):
        if condition[i, j]:
            results.append([temps[i], temps[j], 
                            times[i], times[j], 
                            times_diff[i, j]])
            
results
[[36.443, 43.345, 1622471518.92911, 1622471543.622588, 24.693477869033813],
...
 [36.454, 42.468, 1622471556.2518, 1622471568.472718, 12.22091794013977]]

To avoid the loops altogether, you could define a 3-dimensional full results array and then use the condition array as a Boolean mask to filter out the results you want:为了完全避免循环,您可以定义一个 3 维完整结果数组,然后使用condition数组作为 Boolean 掩码来过滤掉您想要的结果:

import numpy as np

n = len(temps)

temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]

condition = np.logical_and(temps_diff > maxim, 
                           times_diff < cap_time)
            
results_full = np.stack([np.repeat(temps[:, None], n, axis=1), 
                         np.tile(temps, (n, 1)), 
                         np.repeat(times[:, None], n, axis=1), 
                         np.tile(times, (n, 1)), 
                         times_diff])

results = results_full[np.stack(results_full.shape[0] * [condition])]
results.reshape((5, -1)).T
array([[ 3.64430000e+01,  4.33450000e+01,  1.62247152e+09,
         1.62247154e+09,  2.46934779e+01],
       ...
       [ 3.64540000e+01,  4.24680000e+01,  1.62247156e+09,
         1.62247157e+09,  1.22209179e+01],
       ... 
      ])

As you can see, the resulting numbers are the same as above, although this time the results array will contain more rows, because we didn't use the shortcut of starting the inner loop at i+1 .如您所见,结果数字与上面相同,尽管这次results数组将包含更多行,因为我们没有使用从i+1开始内部循环的快捷方式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM