具有許多權重的大型數據集導致Tensorflow的訓練過程極其緩慢

Question

我有生物學背景，目前正在實驗和學習機器學習，以訓練我擁有的微陣列數據集，該數據集包含140個細胞系，每個細胞系的54871個基因表達。 本質上，我有140行，每行由54871列組成，這些列代表一個值，該值是該細胞系的基因表達水平。 基本上是140 * 54871矩陣。 在140個單元行中，我已將每行（單元行）標記為組1或組2，以供我的代碼學習辨別和預測是否要輸入1 * 54871矩陣，該矩陣屬於哪個組。

我將數據集分為兩部分進行訓練和測試。 我的問題來了：由於每個基因表達的權重為54871，因此訓練非常緩慢，因為每1000次迭代中，我的成本函數（均方誤差）僅從0.3057變為0.3047，這大約需要2-3分鍾。 另外，隨着迭代次數的增加，您會看到它處於平穩狀態，這似乎需要花很多時間才能訓練，直到模型的成本函數甚至> = 0.1。 當它以0.3103開頭時，我隔夜將其喚醒，其mse值為0.3014。

我有什么辦法可以加快培訓過程？ 還是我做錯了什么。 謝謝！

這是我的代碼，如果有點雜亂，對不起：

import pandas as pd
import tensorflow as tf
import numpy

# download csv data sheet of all cell lines
input_data = pd.read_csv(
    'C:/Users/lalalalalalala.csv',
    index_col=[0, 1],
    header=0,
    na_values='---')
matrix_data = input_data.as_matrix()

# user define cell lines of interest for supervised training
group1 = input(
    "Please enter cell lines that makes up the your cluster of interest with spaces in between(case sensitive):")
group_split1 = group1.split(sep=" ")

# assign label of each: input cluster = 1
#                      rest of cluster = 0
# extract data of input group
# split training and test set
# all these if else statement represents split when the input group1 is not a even number
split = len(group_split1)
g1_train = input_data.loc[:, group_split1[0:int(split / 2) if len(group_split1) % 2 == 0 else (int(split / 2) + 1)]]
g1_test = input_data.loc[:,
          group_split1[(int(split / 2) if len(group_split1) % 2 == 0 else (int(split / 2) + 1)):split]]
g2 = input_data.loc[:, [x for x in list(input_data) if x not in group_split1]]
split2 = g2.shape[1]
g2_train = g2.iloc[:, 0:int(split2 / 2) if len(group_split1) % 2 == 0 else (int(split2 / 2) + 1)]
g2_test = g2.iloc[:, (int(split2 / 2) if len(group_split1) % 2 == 0 else (int(split2 / 2) + 1)):split2]

# amplify the input data if the input data is too small:
amp1 = (int((g2_train.shape[1] - split) / int(split / 2))) if g2_train.shape[
                                                                  1] >= split else 1  # if g1 is less than g2 amplify
g1_train = pd.DataFrame(pd.np.tile(g1_train, (1, amp1)), index=g2_train.index)
amp2 = (int((g2_test.shape[1] - split) / int(split / 2))) if g2_test.shape[1] >= split else 1
g1_test = pd.DataFrame(pd.np.tile(g1_test, (1, amp2)), index=g2_test.index)
regroup_train = pd.concat([g1_train, g2_train], axis=1, join_axes=[g1_train.index])
regroup_train = numpy.transpose(regroup_train.as_matrix())

regroup_test = pd.concat([g1_test, g2_test], axis=1, join_axes=[g1_test.index])
regroup_test = numpy.transpose(regroup_test.as_matrix())

# create labels
split3 = g1_train.shape[1]
labels_train = numpy.zeros(shape=[len(regroup_train), 1])
labels_train[0:split3] = 1

split4 = g1_test.shape[1]
labels_test = numpy.zeros(shape=[len(regroup_test), 1])
labels_test[0:split4] = 1

# change all nan to 0
regroup_train = numpy.nan_to_num(regroup_train)
regroup_test = numpy.nan_to_num(regroup_test)
labels_train = numpy.nan_to_num(labels_train)
labels_test = numpy.nan_to_num(labels_test)

#######################################################################################################################
#####################################################NEURAL NETWORK####################################################
#######################################################################################################################

# define variables
trainingtimes = 1000

# create model
x = tf.placeholder(tf.float32, [None, 54781])
w = tf.Variable(tf.zeros([54781, 1]))
b = tf.Variable(tf.zeros([1]))
# define linear regression model, loss function
y = tf.nn.sigmoid((tf.matmul(x, w) + b))

# define correct training group
ytt = tf.placeholder(tf.float32, [None, 1])

# define cross optimizer and cost function
mse = tf.reduce_mean(tf.losses.mean_squared_error(y, ytt))

# train step
train_step = tf.train.GradientDescentOptimizer(learning_rate=0.3).minimize(mse)

sess = tf.InteractiveSession()
tf.global_variables_initializer().run()

for i in range(trainingtimes):
    sess.run(train_step, feed_dict={x: regroup_train, ytt: labels_train})
    if i % 100 == 0:
        print(sess.run(mse, feed_dict={x: regroup_train, ytt: labels_train}))

Answer 1

這里有幾個關鍵問題。 您正在嘗試定義一個1層神經網絡，這對這個問題聽起來很不錯。 但是您的隱藏層比應有的要大得多。 試用較小的重量。 嘗試使用128、256、512這樣的數字（不需要2的冪）。

另外，您的輸入維數也很高。 我知道有人在研究非常相似的癌症基因表達問題，有大約60,000個基因表達和10,000個樣本。 她使用PCA來減少數據的維數，同時保持〜90％的方差（她嘗試了不同的值，並發現這與最佳值有關）。

這改善了結果。 神經網絡可能會過擬合，因此降低PCA維數是有益的。 在她的實驗中，這種1層全連接網絡也沒有進行Logstic回歸和XGA增強。

她正在解決此問題的其他幾件事，也可能適用於您：

事實證明，多任務學習可以改善結果。 當她將它們組合成具有4個損失函數的1個神經網絡時，她最初有4個不同的神經網絡（4個輸出給出了相同的數據），從而改善了全部4個結果。
代替PCA，您可以使用自動編碼器作為替代的降維技術。 完全有可能將自動編碼器連接到此網絡，並結合損耗功能對其進行訓練。 我實際上還沒有嘗試過這個（還），所以我只能說我期望它可以改善理論上的結果。 PCA方法測試起來會更快，所以我從這里開始。

具有許多權重的大型數據集導致Tensorflow的訓練過程極其緩慢

問題描述

1 個解決方案

解決方案1
1 2018-01-25 18:13:49

具有許多權重的大型數據集導致Tensorflow的訓練過程極其緩慢

問題描述

1 個解決方案

解決方案1 1 2018-01-25 18:13:49

解決方案1
1 2018-01-25 18:13:49