简体   繁体   English

在 scikit-learn make_circle() 中添加高斯噪声 = 0.05 是什么意思? 它将如何影响数据?

[英]What does it mean to add gaussian noise = 0.05 in scikit-learn make_circle()? How will it affect the data?

I am working on hyperparameter tuning of neural networks and going through examples.我正在研究神经网络的超参数调整并通过示例。 I came across this code in one example:我在一个例子中遇到了这段代码:

train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)

I understand that adding noise has regularization effect on data.我知道添加噪声会对数据产生正则化影响。 Reading the documentation for this tells that it adds guassian noise.阅读文档说明它增加了高斯噪声。 However, in above code, I could not understand what does it means to add 0.05 noise in the data.但是,在上面的代码中,我无法理解在数据中添加0.05噪声意味着什么。 How would this affect data mathematically here?这将如何在数学上影响数据?

I tried below code.我试过下面的代码。 I could see values changing but could not figure out how, for example, row1 values of x in array 1 changed by adding noise= .05 to corresponding row in array 2 ie x_1 here?我可以看到值发生变化,但无法弄清楚,例如,如何通过将noise= .05添加到数组 2 中的相应行(即 x_1)来更改数组 1 中 x 的 row1 值?

np.random.seed(0)
x,y = sklearn.datasets.make_circles()
print(x[:5,:])

x_1,y_1 = sklearn.datasets.make_circles(noise= .05)
print(x_1[:5,:])

Output:输出:

[[-9.92114701e-01 -1.25333234e-01]
 [-1.49905052e-01 -7.85829801e-01]
 [ 9.68583161e-01  2.48689887e-01]
 [ 6.47213595e-01  4.70228202e-01]
 [-8.00000000e-01 -2.57299624e-16]]

[[-0.66187208  0.75151712]
 [-0.86331995 -0.56582111]
 [-0.19574479  0.7798686 ]
 [ 0.40634757 -0.78263011]
 [-0.7433193   0.26658851]]

According to the documentation :根据文档

sklearn.datasets. sklearn.datasets。 make_circles (n_samples=100, *, shuffle=True, noise=None, random_state=None, factor=0.8) make_circles (n_samples=100,*,shuffle=True,noise=None,random_state=None,factor=0.8)
Make a large circle containing a smaller circle in 2d.在 2d 中制作一个包含小圆的大圆。 A simple toy dataset to visualize clustering and classification algorithms.一个简单的玩具数据集,用于可视化聚类和分类算法。

noise : double or None (default=None) Standard deviation of Gaussian noise added to the data.噪声:双倍或无(默认值=无)添加到数据中的高斯噪声的标准偏差。

The statement make_circles(noise=0.05) means that it is creating random circles with a little bit of variation following a Gaussian distribution , also known as a normal distribution.语句make_circles(noise=0.05)意味着它正在创建遵循高斯分布(也称为正态分布)的带有一点变化的随机圆。 You should already know that a random Gaussian distribution means that the numbers being generated have some mean and standard definition.您应该已经知道随机高斯分布意味着生成的数字具有一定的均值和标准定义。 In this case, the call make_circles(noise=0.05) means that the standard deviation is 0.05.在这种情况下,调用make_circles(noise=0.05)意味着标准偏差为 0.05。

Let's invoke the function, check out its output, and see what's the effect of changing the parameter noise .让我们调用这个函数,检查它的输出,看看改变参数noise什么效果。 I'll borrow liberally from this nice tutorial on generating scikit-learn dummy data .我将从这个关于生成 scikit-learn 虚拟数据的好教程中大量借鉴

Let's first call make_circles() with noise=0.0 and take a look at the data.我们先用noise=0.0调用make_circles() ,看一下数据。 I'll use a Pandas dataframe so we can see the data in a tabular way.我将使用 Pandas 数据框,以便我们可以以表格方式查看数据。

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import pandas as pd

n_samples = 100
noise = 0.00

features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
print(df.head())
#           x         y  label
# 0 -0.050232  0.798421      1
# 1  0.968583  0.248690      0
# 2 -0.809017  0.587785      0
# 3 -0.535827  0.844328      0
# 4  0.425779 -0.904827      0

You can see that make_circles returns data instances where each instance is a point with two features, x and y, and a label.您可以看到make_circles返回数据实例,其中每个实例都是具有两个特征 x 和 y 以及一个标签的点。 Let's plot them to see how they actually look like.让我们绘制它们以查看它们的实际外观。

# Collect the points together by label, either 0 or 1
grouped = df.groupby('label')

colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()

在此处输入图片说明

So it looks like it's creating two concentric circles, each with a different label.所以看起来它正在创建两个同心圆,每个同心圆都有不同的标签。

Let's increase the noise to noise=0.05 and see the result:让我们将噪声增加到noise=0.05并查看结果:

n_samples = 100
noise = 0.05  # <--- The only change

features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))

grouped = df.groupby('label')

colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()

在此处输入图片说明

It looks like the noise is added to each of the x, y coordinates to make each point shift around a little bit.看起来噪声被添加到每个 x、y 坐标中,以使每个点移动一点点。 When we inspect the code for make_circles() we see that the implementation does exactly that:当我们检查make_circles()的代码时,我们看到实现正是这样做的:

def make_circles( ..., noise=None, ...):

    ...
    if noise is not None:
        X += generator.normal(scale=noise, size=X.shape)

So now we've seen two visualizations of the dataset with two values of noise .所以现在我们已经看到了数据集的两个可视化,其中包含两个noise值。 But two visualizations isn't cool.但是两个可视化并不酷。 You know what's cool?你知道什么是酷吗? Five visualizations with the noise increasing progressively by 10x.噪声逐渐增加 10 倍的五个可视化 Here's a function that does it:这是一个执行此操作的函数:

def make_circles_plot(n_samples, noise):

    assert n_samples > 0
    assert noise >= 0

    # Use make_circles() to generate random data points with noise.
    features, labels = make_circles(n_samples=n_samples, noise=noise)

    # Create a dataframe for later plotting.
    df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
    grouped = df.groupby('label')
    colors = {0:'red', 1:'blue'}

    fig, ax = plt.subplots(figsize=(5, 5))

    for key, group in grouped:
        group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
    plt.title('Points with noise=%f' % noise)
    plt.xlim(-2, 2)
    plt.ylim(-2, 2)
    plt.grid()
    plt.tight_layout()
    plt.show()

Calling the above function with different values of noise , it can clearly be seen that increasing this value makes the points move around more, ie it makes them more "noisy", exactly as we should expect intuitively.用不同的noise值调用上面的函数,可以清楚地看到增加这个值会使点移动得更多,即它使它们更“嘈杂”,正如我们直观地预期的那样。

for noise in [0.0, 0.01, 0.1, 1.0, 10.0]:
    make_circles_plot(500, noise)

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM