简体   繁体   English

使用python生成数据集群?

[英]Using python to generate clusters of data?

I'm working on a Python function, where I want to model a Gaussian distribution, I'm stuck though. 我正在研究一个Python函数,在那里我想模拟一个高斯分布,但我被困住了。

import numpy.random as rnd
import numpy as np

def genData(co1, co2, M):
  X = rnd.randn(2, 2M + 1)
  t = rnd.randn(1, 2M + 1)
  numpy.concatenate(X, co1)
  numpy.concatenate(X, co2)
  return(X, t)

I'm trying for two clusters of size M, cluster 1 is centered at co1, cluster 2 is centered at co2. 我正在尝试两个大小为M的簇,簇1以co1为中心,簇2以co2为中心。 X would return the data points I'm going to graph, and t are the target values (1 if cluster 1, 2 if cluster 2) so I can color it by cluster. X将返回我将要绘制的数据点,t是目标值(如果是簇1,则为1,如果是簇2,则为2),因此我可以按簇对其进行着色。

In that case, t is size 2M of 1s/2s and X is size 2M * 1, wherein t[i] is 1 if X[i] is in cluster 1 and the same for cluster 2. 在那种情况下,t是2s的1s / 2s,X是2M * 1的大小,其中如果X [i]在簇1中则t [i]是1,而对于簇2则是相同的。

I figured the best way to start doing this is generating the array array using numpys random. 我认为开始这样做的最好方法是使用numpys random生成数组数组。 What I'm confused about is how to get it centered according to the cluster? 我困惑的是如何根据集群使其居中?


Would the best way be to generate a cluster sized M, then add co1 to each of the points? 最好的方法是生成一个大小为M的簇,然后将co1添加到每个点吗? How would I make it random though, and make sure t[i] is colored in properly? 我怎么能让它随机,并确保t [i]正确着色?

I'm using this function to graph the data: 我正在使用此函数来绘制数据图:

def graphData():
    co1 = (0.5, -0.5)
    co2 = (-0.5, 0.5)
    M = 1000
    X, t = genData(co1, co2, M)
    colors = np.array(['r', 'b'])
    plt.figure()
    plt.scatter(X[:, 0], X[:, 1], color = colors[t], s = 10)

For your purpose, I would go for sklearn sample generator make_blobs : 为了您的目的,我会去sklearn样本生成器make_blobs

from sklearn.datasets.samples_generator import make_blobs

centers = [(-5, -5), (5, 5)]
cluster_std = [0.8, 1]

X, y = make_blobs(n_samples=100, cluster_std=cluster_std, centers=centers, n_features=2, random_state=1)

plt.scatter(X[y == 0, 0], X[y == 0, 1], color="red", s=10, label="Cluster1")
plt.scatter(X[y == 1, 0], X[y == 1, 1], color="blue", s=10, label="Cluster2")

You can generate multi-dimensional clusters with this. 您可以使用此方法生成多维集群。 X yields data points and y is determining which cluster a corresponding point in X belongs to. X产生数据点, y确定X对应点属于哪个簇。

在此输入图像描述

This might be too much for what you try to achieve in this case, but generally, I think it's better to rely on more general and better-tested library codes that can be used in other cases as well. 对于您在这种情况下尝试实现的内容,这可能过多,但一般来说,我认为最好依赖于可以在其他情况下使用的更通用且经过更好测试的库代码。

You can use something like following code: 您可以使用以下代码:

center1 = (50, 60)
center2 = (80, 20)
distance = 20


x1 = np.random.uniform(center1[0], center1[0] + distance, size=(100,))
y1 = np.random.normal(center1[1], distance, size=(100,)) 

x2 = np.random.uniform(center2[0], center2[0] + distance, size=(100,))
y2 = np.random.normal(center2[1], distance, size=(100,)) 

plt.scatter(x1, y1)
plt.scatter(x2, y2)
plt.show()

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM