[英]Compare rows with conditions and generate a new dataframe in Pandas
I have a very big dataframe with this structure:我有一个非常大的 dataframe 结构:
Timestamp Val1
Here you can see a real sample:在这里你可以看到一个真实的样本:
Timestamp Temp
0 1622471518.92911 36.443
1 1622471525.034114 36.445
2 1622471531.148139 37.447
3 1622471537.284337 36.449
4 1622471543.622588 43.345
5 1622471549.734765 36.451
6 1622471556.2518 36.454
7 1622471562.361368 41.461
8 1622471568.472718 42.468
9 1622471574.826475 36.470
What I want to do is compare the Temp
column with itself and if is higher than "X", for example 4, and the time between they is lower than "Y", for example 180 min, then I save some data of they.我想要做的是将
Temp
列与其自身进行比较,如果高于“X”,例如 4,并且它们之间的时间低于“Y”,例如 180 分钟,那么我保存它们的一些数据。
Now I'm using two for
loops one inside the other, but this expends to much time and usually pandas
has an option to avoid this.现在我使用两个
for
循环一个在另一个里面,但这会花费很多时间,通常pandas
可以选择避免这种情况。
This is my code:这是我的代码:
cap_time, maxim = 180, 4
cap_time = cap_time * 60
temps= df['Temperature'].values
times = df['Timestamp'].values
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
print(i,j,len(temps))
if float(temps[j]) > float(temps[i])*maxim:
timeIn = dt.datetime.fromtimestamp(float(times[i]))
timeOut = dt.datetime.fromtimestamp(float(times[j]))
diff = timeOut - timeIn
tdiff = diff.total_seconds()
if dd > cap_time:
break
else:
res = [temps[i], temps[j], times[i], times[j], tdiff/60, cap_time/60, maxim]
results.append(res)
break
# Then I save it in a dataframe and another actions
Can Pandas
help me to achieve my goal and reduce the execution time? Pandas
能否帮助我实现目标并减少执行时间? I found dataFrame.diff()
but I'm not sure is what I want (or I don`t know how to use it).我找到
dataFrame.diff()
但我不确定我想要什么(或者我不知道如何使用它)。
Thank you very much.非常感谢。
Short of avoiding the nested for
loops, you can already speed things up by avoiding all unnecessary calculations and conversions within the loops.除了避免嵌套
for
循环之外,您已经可以通过避免循环中所有不必要的计算和转换来加快速度。 In particular, you can use NumPy broadcasting to define a Boolean array beforehand, in which you can look up whether the condition is met:特别是可以使用 NumPy广播预先定义一个 Boolean 数组,在其中可以查看是否满足条件:
import numpy as np
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
if condition[i, j]:
results.append([temps[i], temps[j],
times[i], times[j],
times_diff[i, j]])
results
[[36.443, 43.345, 1622471518.92911, 1622471543.622588, 24.693477869033813],
...
[36.454, 42.468, 1622471556.2518, 1622471568.472718, 12.22091794013977]]
To avoid the loops altogether, you could define a 3-dimensional full results array and then use the condition
array as a Boolean mask to filter out the results you want:为了完全避免循环,您可以定义一个 3 维完整结果数组,然后使用
condition
数组作为 Boolean 掩码来过滤掉您想要的结果:
import numpy as np
n = len(temps)
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results_full = np.stack([np.repeat(temps[:, None], n, axis=1),
np.tile(temps, (n, 1)),
np.repeat(times[:, None], n, axis=1),
np.tile(times, (n, 1)),
times_diff])
results = results_full[np.stack(results_full.shape[0] * [condition])]
results.reshape((5, -1)).T
array([[ 3.64430000e+01, 4.33450000e+01, 1.62247152e+09,
1.62247154e+09, 2.46934779e+01],
...
[ 3.64540000e+01, 4.24680000e+01, 1.62247156e+09,
1.62247157e+09, 1.22209179e+01],
...
])
As you can see, the resulting numbers are the same as above, although this time the results
array will contain more rows, because we didn't use the shortcut of starting the inner loop at i+1
.如您所见,结果数字与上面相同,尽管这次
results
数组将包含更多行,因为我们没有使用从i+1
开始内部循环的快捷方式。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.