简体   繁体   English

PyTorch (GPU) 慢于 CPU 慢于 keras

[英]PyTorch (GPU) slower than CPU slower than keras

I'm just getting started with PyTorch and I wanted to run through a few toy problems.我刚刚开始使用 PyTorch,我想解决一些玩具问题。 In the following case, I'm noticing a significant difference in how much time it takes for the model to train once over and issue one batch of predictions.在以下情况下,我注意到 model 训练一次并发布一批预测所需的时间存在显着差异。

This is the PyTorch implementation.这是 PyTorch 实现。 On the GPU, it takes ~17 seconds on my machine.在 GPU 上,在我的机器上大约需要 17 秒。 The same model on the CPU takes ~11 seconds. CPU 上相同的 model 需要大约 11 秒。

class LR(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = torch.nn.Linear(2, 20)
        self.linear2 = torch.nn.Linear(20, 1)

    def forward(self, x):
        x = torch.nn.functional.relu(self.linear1(x))
        x = torch.nn.functional.relu(self.linear2(x))
        return x


def fit_torch(df_train, df_test):
    sampler_tr = torch.utils.data.SubsetRandomSampler(df_train.index)
    train = torch.utils.data.DataLoader(
        torch.tensor(df_train.values, dtype=torch.float),
        batch_size=batch_size, sampler=sampler_tr)

    sampler_te = torch.utils.data.SubsetRandomSampler(df_test.index)
    test = torch.utils.data.DataLoader(
        torch.tensor(df_test.values, dtype=torch.float),
        batch_size=batch_size, sampler=sampler_te)

    model = LR()
    model = model.to(device)

    loss = torch.nn.MSELoss()
    optim = torch.optim.Adam(model.parameters(), lr=0.001)

    model.train()
    for _ in range(1000):
        for train_data in train:
            train_data = train_data.to(device)

            x_train = train_data[:, :2]
            y_train = train_data[:, 2]

            optim.zero_grad()

            pred = model(x_train)
            loss_val = loss(pred.squeeze(), y_train)

            loss_val.backward()
            optim.step()

    model.eval()
    with torch.no_grad():
        for test_data in test:
            test_data = test_data.to(device)

            pred = model(test_data[:, :2].float())
            break


This is the keras implementation.这是 keras 实现。 It takes approximately 9 seconds to run.运行大约需要 9 秒。

def fit_tf(df_train, df_test):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Dense(20, activation='relu'))
    model.add(tf.keras.layers.Dense(1, activation='relu'))

    model.compile(loss='mse', optimizer='adam')
    model.fit(
        df_train.values[:, :2],
        df_train.values[:, 2],
        batch_size=batch_size, epochs=1000, verbose=0)

    model.predict(df_test.iloc[:batch_size].values[:, :2])

The dataset and main functions.数据集和主要功能。

device = torch.device('cuda:0')
scaler = MinMaxScaler()

batch_size = 64

def create_dataset():
    dataset = []    
    random_x = np.random.randint(10, 1000, 1000)
    random_y = np.random.randint(10, 1000, 1000)

    for x, y in zip(random_x, random_y):
        dataset.append((x, y, 4 * x + 3 * y + 10))

    np.random.shuffle(dataset)
    df = pd.DataFrame(dataset)
    df = pd.DataFrame(scaler.fit_transform(df))

    return df

def __main__():
    df = create_dataset()
    df_train, df_test = train_test_split(df)

    start_time = time.time()
    fit_tf(df_train.reset_index(drop=True), df_test.reset_index(drop=True))
    print(time.time() - start_time)

PyTorch uses a dynamic computational graph by default, which is more flexible when you start to develop a neural network since it will give a more straight forward debug message. PyTorch 默认使用动态计算图,当您开始开发神经网络时会更加灵活,因为它会提供更直接的调试信息。 TensorFlow, in contrast, will produce a static computational graph, and that is why you need to compile the model before use it.相比之下,TensorFlow 将生成 static 计算图,这就是为什么在使用之前需要编译 model 的原因。 The compiler can optimize your model, but the tradeoff is the neural network becomes difficult to debug.编译器可以优化您的 model,但代价是神经网络变得难以调试。 This may cause minor difference between the performance of the two framework, but should not be a big deal.这可能会导致两个框架的性能存在细微差别,但应该不是什么大问题。

Since your network is pretty small, the overhead to copy the network between GPU memory and CPU memory and to initiate the CUDA subsystem exceeds the benefit brought by the GPU. Since your network is pretty small, the overhead to copy the network between GPU memory and CPU memory and to initiate the CUDA subsystem exceeds the benefit brought by the GPU. If you try some more complex neural network such as AlexNet, ResNet or even GoogLeNet, the benefit will be much more obvious.如果你尝试一些更复杂的神经网络,比如 AlexNet、ResNet 甚至 GoogLeNet,好处会更加明显。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM