简体   繁体   English

通过 python pandas dataframe 中的浮点数对行进行分组

[英]Grouping rows by proximity of floats in a python pandas dataframe

It feels so straight forward but I haven't found the answer to my question yet.感觉很直接,但我还没有找到我的问题的答案。 How does one group by proximity, or closeness, of two floats in pandas? pandas 中两个浮点数的一组接近度或接近度如何?

Ok, I could do this the loopy way but my data is big and I hope I can expand my pandas skills with your help and do this elegantly:好的,我可以用循环的方式做到这一点,但我的数据很大,我希望我能在你的帮助下扩展我的 pandas 技能,并优雅地做到这一点:

I have a column of times in nanoseconds in my DataFrame.我的 DataFrame 中有一列以纳秒为单位的时间。 I want to group these based on the proximity of their values to little clusters.我想根据它们的值与小集群的接近程度对它们进行分组。 Most of them will be two rows per cluster maybe up to five or six.它们中的大多数将是每个集群两行,可能最多五到六行。 I do not know the number of clusters.我不知道集群的数量。 It will be a massive amount of very small clusters.这将是大量非常小的集群。 I thought I could eg introduce a second index or just an additional column with 1 for all rows of the first cluster, 2 for the second and so forth so that groupby gets straight forward thereafter.我想我可以例如引入第二个索引或只是一个附加列,第一个集群的所有行为 1,第二个为 2,依此类推,以便 groupby 此后直接前进。 something like:就像是:

t (ns)吨(纳秒) cluster
71 71 1524957248.4375 1524957248.4375 1 1
72 72 1524957265.625 1524957265.625 1 1
699 699 14624846476.5625 14624846476.5625 2 2
700 700 14624846653.125 14624846653.125 2 2
701 701 14624846661.287 14624846661.287 2 2
1161 1161 25172864926.5625 25172864926.5625 3 3
1160 1160 25172864935.9375 25172864935.9375 3 3

Thanks for your help!谢谢你的帮助!

Assuming you want to create the "cluster" column from the index based on the proximity of the successive values, you could use:假设您想根据连续值的接近度从索引创建“集群”列,您可以使用:

thresh = 1
df['cluster'] = df.index.to_series().diff().gt(thresh).cumsum().add(1)

using the "t (ns)":使用“t(ns)”:

thresh = 1
df['cluster'] = df['t (ns)'].diff().gt(thresh).cumsum().add(1)

output: output:

            t (ns)  cluster
71    1.524957e+09        1
72    1.524957e+09        1
699   1.462485e+10        2
700   1.462485e+10        2
701   1.462485e+10        2
1161  2.517286e+10        3
1160  2.517286e+10        3

You can 'round' the t (ns) column by floor dividing them with a threshold value and looking at their differences:您可以将t (ns)列“四舍五入”,方法是将它们除以阈值并查看它们的差异:

df[['t (ns)']].assign(
    cluster=(df['t (ns)'] // 10E7)
    .diff().gt(0).cumsum().add(1)
)

Or you can experiment with the number of clusters you try to organize your data:或者您可以尝试尝试组织数据的集群数量:

bins=3
df[['t (ns)']].assign(
    bins=pd.cut(
        df['t (ns)'], bins=bins).cat.rename_categories(range(1, bins + 1)
    )
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM