从数据集中过滤非“群组”

Question

I am sure this topic has been researched before, I am not sure what it is called or what techniques I should also look into, hence why I am here.我确信之前已经研究过这个主题，我不确定它叫什么或者我还应该研究什么技术，因此我为什么在这里。 I am running this mainly in Python and Pandas but it is not limited to those languages/technologies.我主要在 Python 和 Pandas 中运行它，但它不仅限于这些语言/技术。

As an example, let's pretend I have this dataset:举个例子，让我们假设我有这个数据集：

| PID | A    | B    | C    |
| --- | ---- | ---- | ---- |
| 508 | 0.85 | 0.51 | 0.05 |
| 400 | 0.97 | 0.61 | 0.30 |
| 251 | 0.01 | 0.97 | 0.29 |
| 414 | 0.25 | 0.04 | 0.83 |
| 706 | 0.37 | 0.32 | 0.33 |
| 65  | 0.78 | 0.62 | 0.25 |
| 533 | 0.24 | 0.15 | 0.88 |

PID is a unique ID to that row. PID 是该行的唯一 ID。 A, B and C are some factors (normalized for this example). A、B 和 C 是一些因素（针对此示例进行了标准化）。 This datasets could be players in a sports league over history, it could be products in an inventory, it could be voter data.该数据集可以是历史上体育联盟中的球员，可以是库存中的产品，也可以是选民数据。 The specific context isn't important.具体的上下文并不重要。

Now let's say I have some input data:现在假设我有一些输入数据：

| A    | B    | C    |
| ---- | ---- | ---- |
| 0.81 | 0.75 | 0.17 |

This input shares the same factors as the original dataset (A, B, C).此输入与原始数据集（A、B、C）共享相同的因素。 What I want to do is to find the rows that are similar to my input data (the "cohorts").我想要做的是找到与我的输入数据（“同类群组”）相似的行。 What is the best way to approach this?解决这个问题的最佳方法是什么？

I thought of clustering, using a kNN algorithm, but the problem is that the number of cohorts is not set.我想到了聚类，使用kNN算法，但问题是没有设置群组数量。 You could have unique input and have few/no "cohorts", or you could have input that is very common and have hundreds of "cohorts".您可以有独特的输入并且很少/没有“群组”，或者您可以使用非常常见的输入并且有数百个“群组”。

The solution I next tried was Euclidean Distance.我接下来尝试的解决方案是欧几里得距离。 So for this dataset and input I would do something like:所以对于这个数据集和输入，我会做这样的事情：

my_cols = ['A', 'B', 'C']

inputdata = pd.Series([0.81, 0.75, 0.17], index=['A', 'B', 'C'])

# df = pandas data frame with above data

df['Dict'] = (df[my_cols] - inputdata).pow(2).sum(1).pow(0.5)

This would create a new column on the dataset like:这将在数据集上创建一个新列，例如：

| PID | A    | B    | C    | Dist |
| --- | ---- | ---- | ---- | ---- |
| 508 | 0.85 | 0.51 | 0.05 | 0.27 |
| 400 | 0.97 | 0.61 | 0.30 | 0.25 |
| 251 | 0.01 | 0.97 | 0.29 | 0.84 |
| 414 | 0.25 | 0.04 | 0.83 | 1.12 |
| 706 | 0.37 | 0.32 | 0.33 | 0.63 |
| 65  | 0.78 | 0.62 | 0.25 | 0.16 |
| 533 | 0.24 | 0.15 | 0.88 | 1.09 |

You can then "filter" out those rows below some threshold.然后，您可以“过滤”出低于某个阈值的那些行。

cohorts = df[df['Dist'] <= THRESHOLD]

The issue then becomes (1) How do you determine that best threshold?那么问题就变成了 (1) 您如何确定最佳阈值？ and (2) If I add a 4th factor ("D") into the dataset and Euclid calculation, it seems to "break" the results, in that the cohorts no longer make intuitive sense, looking at the results. (2) 如果我在数据集和 Euclid 计算中添加第 4 个因子（“D”），它似乎“破坏”了结果，因为从结果来看，队列不再具有直观意义。

So my question is: what are techniques or better ways to filter/select "cohorts" (those rows similar to an input row) ?所以我的问题是：过滤/选择“同类群组”（与输入行类似的那些行）的技术或更好的方法是什么？

Thank you谢谢

Answer 1

Here is an algorithm I came up with myself by logical thinking and some basic statistics.这是我通过逻辑思维和一些基本统计得出的算法。 It uses the mean of the values and the mean of your input data to find the closest matches based on the standard deviation using pd.merge_asof :它采用了mean的值和平均输入数据基础上找到的最接近的匹配的standard deviation使用pd.merge_asof ：

factors = ['A', 'B', 'C']
df = df.assign(avg=df[factors].mean(axis=1)).sort_values('avg')
input_data = input_data.assign(avg=input_data[factors].mean(axis=1)).sort_values('avg')

dfn = pd.merge_asof(
    df,
    input_data,
    on='avg',
    direction='nearest',
    tolerance=df['avg'].std()
)

   PID   A_x   B_x   C_x       avg   A_y   B_y   C_y
0  706  0.37  0.32  0.33  0.340000   NaN   NaN   NaN
1  414  0.25  0.04  0.83  0.373333   NaN   NaN   NaN
2  251  0.01  0.97  0.29  0.423333   NaN   NaN   NaN
3  533  0.24  0.15  0.88  0.423333   NaN   NaN   NaN
4  508  0.85  0.51  0.05  0.470000   NaN   NaN   NaN
5   65  0.78  0.62  0.25  0.550000  0.81  0.75  0.17
6  400  0.97  0.61  0.30  0.626667  0.81  0.75  0.17

Answer 2

You're facing a clustering problem so your K-means intuition was right.您正面临聚类问题，因此您的 K 均值直觉是正确的。

clustering聚类

But, as you mentioned K-means is a parametric approach, so you need to determine the right K. There is an automated way of finding the best K regarding cluster quality (shape, stability, homogeneity) which is named the elbow method: https://www.scikit-yb.org/en/latest/api/cluster/elbow.html但是，正如您提到的 K 均值是一种参数化方法，因此您需要确定正确的 K。有一种自动方法可以找到关于集群质量（形状、稳定性、同质性）的最佳 K，称为肘部方法： https ://www.scikit-yb.org/en/latest/api/cluster/elbow.html

Then, you can use another clustering approach (in fact the right clustering algorithm depends of the meaning of your features), for example you can use a density based approach with DBSCAN ( https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html ).然后，您可以使用另一种聚类方法（实际上，正确的聚类算法取决于您的特征的含义），例如您可以在 DBSCAN 中使用基于密度的方法（ https://scikit-learn.org/stable/modules/生成/sklearn.cluster.DBSCAN.html ）。

Thus, you'll need to identify the best clustering algorithm depending on your problem: https://machinelearningmastery.com/clustering-algorithms-with-python/因此，您需要根据您的问题确定最佳聚类算法： https : //machinelearningmastery.com/clustering-algorithms-with-python/

With this solution you'll fit your clustering algorithm on your training set (the one you name "cohort" set), and the use the model to predict the cluster on your "non-cohort" samples.使用此解决方案，您将在训练集（您命名为“队列”集）上拟合聚类算法，并使用该模型预测“非队列”样本上的聚类。

statistical cohorts统计队列

In some fields, like the marketing field, you'll also find methods to create clusters (cohorts) based on some numerical attributes and using descriptive statistics.在某些领域，如市场营销领域，您还会找到基于一些数字属性和使用描述性统计创建集群（同类群组）的方法。

The best example is the RFM segmentation method , which is a really smart way of doing clustering while keeping high intelligibility for resulting clusters: https://towardsdatascience.com/know-your-customers-with-rfm-9f88f09433bc最好的例子是RFM 分割方法，这是一种非常聪明的聚类方法，同时保持结果聚类的高度可理解性： https : //towardsdatascience.com/know-your-customers-with-rfm-9f88f09433bc

Using this approach you'll build your features on your entire set of data, and then get the resulting segments depending on feature values.使用这种方法，您将在整个数据集上构建特征，然后根据特征值获取结果段。

Answer 3

My understanding is that you want the distance from each column to be independently accounted for, but collected together in the final result.我的理解是，您希望与每列的距离独立计算，但在最终结果中收集在一起。

To get that independent accounting, you can find out a measure of how different the members in a column are by using its standard deviation σ ( whimsical set of explanations ).要获得独立的会计核算，您可以通过使用其标准偏差 σ （异想天开的解释集）来衡量列中成员的差异程度。

To collect together the final result, you can filter your dataframe iteratively, removing rows which are outside the wanted range.要将最终结果收集在一起，您可以迭代地过滤数据框，删除超出所需范围的行。 This also successively reduces the processing time, though it'll be negligible unless you have a great deal of data.这也连续减少了处理时间，但除非您有大量数据，否则它可以忽略不计。

If adding your fourth column causes no data to be sufficiently close, this could indicate如果添加第四列导致没有数据足够接近，这可能表明

your test data is really not close to any of the source data and is a unique entry你的测试数据真的不接近任何源数据，是一个独特的条目
your data is not normally distributed (if more data is available, you can test this with scikit.stats.normaltest )您的数据不是正态分布的（如果有更多数据可用，您可以使用scikit.stats.normaltest进行测试）
your columns are not independent (ie. need more specialized statistical handling)您的列不是独立的（即需要更专业的统计处理）

If the second or third is the case, you should not use the normal standard deviation, but one from from another distribution ( list and more tests )如果是第二个或第三个，则不应使用正态标准差，而应使用来自另一个分布的标准差（列表和更多测试）

However, if your data is seemingly random, you can apply some factor and/or power (ie. the variance) of the standard deviation in each column to get more or less-accurate results.但是，如果您的数据看似随机，您可以在每列中应用标准偏差的一些因子和/或功效（即方差）以获得或多或少准确的结果。

initial dataframes初始数据帧

starting data (df)

PID       A     B     C
508.0  0.85  0.51  0.05
400.0  0.97  0.61   0.3
251.0  0.01  0.97  0.29
414.0  0.25  0.04  0.83
706.0  0.37  0.32  0.33
65.0   0.78  0.62  0.25
533.0  0.24  0.15  0.88

test data (test_data)

      A     B     C
0  0.81  0.75  0.17

df.std() df.std()

find the standard deviation of each column and collect it into a new dataframe找到每一列的标准偏差并将其收集到一个新的数据框中

then assemble another dataframe with this然后用这个组装另一个数据帧

stdv = df.std()

PID
A    0.367145
B    0.316965
C    0.312219

test_df = pd.DataFrame()
test_df = test_df.append(test_data - stdv)
test_df = test_df.append(test_data + stdv)
test_df.index = ["low", "high"]

test_df

             A         B         C
low   0.442855  0.433035 -0.142219
high  1.177145  1.066965  0.482219

Results结果

iterate over the columns, filtering out those outside the wanted range ( pandas Series.between() can do this for you!)迭代列，过滤掉想要范围之外的那些（ pandas Series.between()可以为你做到这一点！）

for x in df:
    df = df[df[x].between(test_df[x]["low"], test_df[x]["high"])]

resulting df

PID       A     B     C
508.0  0.85  0.51  0.05
400.0  0.97  0.61   0.3
65.0   0.78  0.62  0.25

Answer 4

As you do not really know the number of clusters (cohorts) or their structure, I believe that OPTICS algorithm would suit you best.由于您并不真正了解集群（群组）的数量或其结构，我相信OPTICS算法最适合您。 It finds a group of points that are packed together (using Euclidian distance), and expands to build a cluster from them.它找到一组打包在一起的点（使用欧几里德距离），并扩展以从它们构建一个集群。 Then it is easy to find the cluster a new point belongs (or not) to.然后很容易找到一个新点属于（或不属于）的集群。 It is similar to the DBSCAN , but does not make assumption of similar density among the clusters.它类似于DBSCAN ，但不假设簇之间的密度相似。 sklearn library includes the implementation of the OPTICS. sklearn库包括 OPTICS 的实现。

从数据集中过滤非“群组”

问题描述

4 个解决方案

解决方案1
2 2020-10-02 16:09:49

解决方案2
1 2020-10-09 09:35:50

clustering聚类

statistical cohorts统计队列

解决方案3
1 2020-10-09 17:38:25

解决方案4
1 2020-10-10 13:37:17

从数据集中过滤非“群组”

问题描述

4 个解决方案

解决方案1 2 2020-10-02 16:09:49

解决方案2 1 2020-10-09 09:35:50

clustering聚类

statistical cohorts统计队列

解决方案3 1 2020-10-09 17:38:25

解决方案4 1 2020-10-10 13:37:17

解决方案1
2 2020-10-02 16:09:49

解决方案2
1 2020-10-09 09:35:50

解决方案3
1 2020-10-09 17:38:25

解决方案4
1 2020-10-10 13:37:17