[英]Grouping rows by proximity of floats in a python pandas dataframe
It feels so straight forward but I haven't found the answer to my question yet.感觉很直接,但我还没有找到我的问题的答案。 How does one group by proximity, or closeness, of two floats in pandas? pandas 中两个浮点数的一组接近度或接近度如何?
Ok, I could do this the loopy way but my data is big and I hope I can expand my pandas skills with your help and do this elegantly:好的,我可以用循环的方式做到这一点,但我的数据很大,我希望我能在你的帮助下扩展我的 pandas 技能,并优雅地做到这一点:
I have a column of times in nanoseconds in my DataFrame.我的 DataFrame 中有一列以纳秒为单位的时间。 I want to group these based on the proximity of their values to little clusters.我想根据它们的值与小集群的接近程度对它们进行分组。 Most of them will be two rows per cluster maybe up to five or six.它们中的大多数将是每个集群两行,可能最多五到六行。 I do not know the number of clusters.我不知道集群的数量。 It will be a massive amount of very small clusters.这将是大量非常小的集群。 I thought I could eg introduce a second index or just an additional column with 1 for all rows of the first cluster, 2 for the second and so forth so that groupby gets straight forward thereafter.我想我可以例如引入第二个索引或只是一个附加列,第一个集群的所有行为 1,第二个为 2,依此类推,以便 groupby 此后直接前进。 something like:就像是:
t (ns)吨(纳秒) | cluster簇 | |
---|---|---|
71 71 | 1524957248.4375 1524957248.4375 | 1 1 |
72 72 | 1524957265.625 1524957265.625 | 1 1 |
699 699 | 14624846476.5625 14624846476.5625 | 2 2 |
700 700 | 14624846653.125 14624846653.125 | 2 2 |
701 701 | 14624846661.287 14624846661.287 | 2 2 |
1161 1161 | 25172864926.5625 25172864926.5625 | 3 3 |
1160 1160 | 25172864935.9375 25172864935.9375 | 3 3 |
Thanks for your help!谢谢你的帮助!
Assuming you want to create the "cluster" column from the index based on the proximity of the successive values, you could use:假设您想根据连续值的接近度从索引创建“集群”列,您可以使用:
thresh = 1
df['cluster'] = df.index.to_series().diff().gt(thresh).cumsum().add(1)
using the "t (ns)":使用“t(ns)”:
thresh = 1
df['cluster'] = df['t (ns)'].diff().gt(thresh).cumsum().add(1)
output: output:
t (ns) cluster
71 1.524957e+09 1
72 1.524957e+09 1
699 1.462485e+10 2
700 1.462485e+10 2
701 1.462485e+10 2
1161 2.517286e+10 3
1160 2.517286e+10 3
You can 'round' the t (ns)
column by floor dividing them with a threshold value and looking at their differences:您可以将t (ns)
列“四舍五入”,方法是将它们除以阈值并查看它们的差异:
df[['t (ns)']].assign(
cluster=(df['t (ns)'] // 10E7)
.diff().gt(0).cumsum().add(1)
)
Or you can experiment with the number of clusters you try to organize your data:或者您可以尝试尝试组织数据的集群数量:
bins=3
df[['t (ns)']].assign(
bins=pd.cut(
df['t (ns)'], bins=bins).cat.rename_categories(range(1, bins + 1)
)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.