通过 python pandas dataframe 中的浮点数对行进行分组

Question

It feels so straight forward but I haven't found the answer to my question yet.感觉很直接，但我还没有找到我的问题的答案。 How does one group by proximity, or closeness, of two floats in pandas? pandas 中两个浮点数的一组接近度或接近度如何？

Ok, I could do this the loopy way but my data is big and I hope I can expand my pandas skills with your help and do this elegantly:好的，我可以用循环的方式做到这一点，但我的数据很大，我希望我能在你的帮助下扩展我的 pandas 技能，并优雅地做到这一点：

I have a column of times in nanoseconds in my DataFrame.我的 DataFrame 中有一列以纳秒为单位的时间。 I want to group these based on the proximity of their values to little clusters.我想根据它们的值与小集群的接近程度对它们进行分组。 Most of them will be two rows per cluster maybe up to five or six.它们中的大多数将是每个集群两行，可能最多五到六行。 I do not know the number of clusters.我不知道集群的数量。 It will be a massive amount of very small clusters.这将是大量非常小的集群。 I thought I could eg introduce a second index or just an additional column with 1 for all rows of the first cluster, 2 for the second and so forth so that groupby gets straight forward thereafter.我想我可以例如引入第二个索引或只是一个附加列，第一个集群的所有行为 1，第二个为 2，依此类推，以便 groupby 此后直接前进。 something like:就像是：

	t (ns)吨（纳秒）	cluster簇
71 71	1524957248.4375 1524957248.4375	1 1
72 72	1524957265.625 1524957265.625	1 1
699 699	14624846476.5625 14624846476.5625	2 2
700 700	14624846653.125 14624846653.125	2 2
701 701	14624846661.287 14624846661.287	2 2
1161 1161	25172864926.5625 25172864926.5625	3 3
1160 1160	25172864935.9375 25172864935.9375	3 3

Thanks for your help!谢谢你的帮助！

Answer 1

Assuming you want to create the "cluster" column from the index based on the proximity of the successive values, you could use:假设您想根据连续值的接近度从索引创建“集群”列，您可以使用：

thresh = 1
df['cluster'] = df.index.to_series().diff().gt(thresh).cumsum().add(1)

using the "t (ns)":使用“t（ns）”：

thresh = 1
df['cluster'] = df['t (ns)'].diff().gt(thresh).cumsum().add(1)

output: output：

            t (ns)  cluster
71    1.524957e+09        1
72    1.524957e+09        1
699   1.462485e+10        2
700   1.462485e+10        2
701   1.462485e+10        2
1161  2.517286e+10        3
1160  2.517286e+10        3

Answer 2

You can 'round' the t (ns) column by floor dividing them with a threshold value and looking at their differences:您可以将t (ns)列“四舍五入”，方法是将它们除以阈值并查看它们的差异：

df[['t (ns)']].assign(
    cluster=(df['t (ns)'] // 10E7)
    .diff().gt(0).cumsum().add(1)
)

Or you can experiment with the number of clusters you try to organize your data:或者您可以尝试尝试组织数据的集群数量：

bins=3
df[['t (ns)']].assign(
    bins=pd.cut(
        df['t (ns)'], bins=bins).cat.rename_categories(range(1, bins + 1)
    )
)

通过 python pandas dataframe 中的浮点数对行进行分组

问题描述

2 个解决方案

解决方案1
0 2022-09-06 12:28:15

解决方案2
0 2022-09-06 12:50:48

通过 python pandas dataframe 中的浮点数对行进行分组

问题描述

2 个解决方案

解决方案1 0 2022-09-06 12:28:15

解决方案2 0 2022-09-06 12:50:48

解决方案1
0 2022-09-06 12:28:15

解决方案2
0 2022-09-06 12:50:48