在 scikit-learn make_circle() 中添加高斯噪声 = 0.05 是什么意思？它将如何影响数据？

Question

我正在研究神经网络的超参数调整并通过示例。 我在一个例子中遇到了这段代码：

train_X, train_Y = sklearn.datasets.make_circles(n_samples=300, noise=.05)

我知道添加噪声会对数据产生正则化影响。 阅读文档说明它增加了高斯噪声。 但是，在上面的代码中，我无法理解在数据中添加0.05噪声意味着什么。 这将如何在数学上影响数据？

我试过下面的代码。 我可以看到值发生变化，但无法弄清楚，例如，如何通过将noise= .05添加到数组 2 中的相应行（即 x_1）来更改数组 1 中 x 的 row1 值？

np.random.seed(0)
x,y = sklearn.datasets.make_circles()
print(x[:5,:])

x_1,y_1 = sklearn.datasets.make_circles(noise= .05)
print(x_1[:5,:])

输出：

[[-9.92114701e-01 -1.25333234e-01]
 [-1.49905052e-01 -7.85829801e-01]
 [ 9.68583161e-01  2.48689887e-01]
 [ 6.47213595e-01  4.70228202e-01]
 [-8.00000000e-01 -2.57299624e-16]]

[[-0.66187208  0.75151712]
 [-0.86331995 -0.56582111]
 [-0.19574479  0.7798686 ]
 [ 0.40634757 -0.78263011]
 [-0.7433193   0.26658851]]

Answer 1

根据文档：

sklearn.datasets。 make_circles （n_samples=100，*，shuffle=True，noise=None，random_state=None，factor=0.8）
在 2d 中制作一个包含小圆的大圆。 一个简单的玩具数据集，用于可视化聚类和分类算法。

噪声：双倍或无（默认值=无）添加到数据中的高斯噪声的标准偏差。

语句make_circles(noise=0.05)意味着它正在创建遵循高斯分布（也称为正态分布）的带有一点变化的随机圆。 您应该已经知道随机高斯分布意味着生成的数字具有一定的均值和标准定义。 在这种情况下，调用make_circles(noise=0.05)意味着标准偏差为 0.05。

让我们调用这个函数，检查它的输出，看看改变参数noise什么效果。 我将从这个关于生成 scikit-learn 虚拟数据的好教程中大量借鉴。

我们先用noise=0.0调用make_circles() ，看一下数据。 我将使用 Pandas 数据框，以便我们可以以表格方式查看数据。

from sklearn.datasets import make_circles
import matplotlib.pyplot as plt
import pandas as pd

n_samples = 100
noise = 0.00

features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
print(df.head())
#           x         y  label
# 0 -0.050232  0.798421      1
# 1  0.968583  0.248690      0
# 2 -0.809017  0.587785      0
# 3 -0.535827  0.844328      0
# 4  0.425779 -0.904827      0

您可以看到make_circles返回数据实例，其中每个实例都是具有两个特征 x 和 y 以及一个标签的点。 让我们绘制它们以查看它们的实际外观。

# Collect the points together by label, either 0 or 1
grouped = df.groupby('label')

colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()

所以看起来它正在创建两个同心圆，每个同心圆都有不同的标签。

让我们将噪声增加到noise=0.05并查看结果：

n_samples = 100
noise = 0.05  # <--- The only change

features, labels = make_circles(n_samples=n_samples, noise=noise)
df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))

grouped = df.groupby('label')

colors = {0:'red', 1:'blue'}
fig, ax = plt.subplots(figsize=(7,7))
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
plt.title('Points')
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.grid()
plt.show()

看起来噪声被添加到每个 x、y 坐标中，以使每个点移动一点点。 当我们检查make_circles()的代码时，我们看到实现正是这样做的：

def make_circles( ..., noise=None, ...):

    ...
    if noise is not None:
        X += generator.normal(scale=noise, size=X.shape)

所以现在我们已经看到了数据集的两个可视化，其中包含两个noise值。 但是两个可视化并不酷。 你知道什么是酷吗？ 噪声逐渐增加 10 倍的五个可视化。 这是一个执行此操作的函数：

def make_circles_plot(n_samples, noise):

    assert n_samples > 0
    assert noise >= 0

    # Use make_circles() to generate random data points with noise.
    features, labels = make_circles(n_samples=n_samples, noise=noise)

    # Create a dataframe for later plotting.
    df = pd.DataFrame(dict(x=features[:,0], y=features[:,1], label=labels))
    grouped = df.groupby('label')
    colors = {0:'red', 1:'blue'}

    fig, ax = plt.subplots(figsize=(5, 5))

    for key, group in grouped:
        group.plot(ax=ax, kind='scatter', x='x', y='y', marker='.', label=key, color=colors[key])
    plt.title('Points with noise=%f' % noise)
    plt.xlim(-2, 2)
    plt.ylim(-2, 2)
    plt.grid()
    plt.tight_layout()
    plt.show()

用不同的noise值调用上面的函数，可以清楚地看到增加这个值会使点移动得更多，即它使它们更“嘈杂”，正如我们直观地预期的那样。

for noise in [0.0, 0.01, 0.1, 1.0, 10.0]:
    make_circles_plot(500, noise)

在 scikit-learn make_circle() 中添加高斯噪声 = 0.05 是什么意思？它将如何影响数据？

问题描述

1 个解决方案

解决方案1
4 2020-09-17 03:05:38

在 scikit-learn make_circle() 中添加高斯噪声 = 0.05 是什么意思？ 它将如何影响数据？

问题描述

1 个解决方案

解决方案1 4 2020-09-17 03:05:38

在 scikit-learn make_circle() 中添加高斯噪声 = 0.05 是什么意思？它将如何影响数据？

解决方案1
4 2020-09-17 03:05:38