[英]What is a faster and more Pythonic way to read the CSV and make a data frame from it?
Input : A CSV with 50,000 rows; 输入 :50,000行的CSV; each row containing 910 columns of value 0/1. 每行包含910列值0/1。
Output : A data frame to run my CNN on. 输出 :运行我的CNN的数据帧。
I wrote a code that reads the CSV line by line. 我写了一段代码,逐行读取CSV。 For each line, I split the data into 2 parts, called neurons (900 columns) and labels (10 columns). 对于每一行,我将数据分为两部分,分别称为神经元 (900列)和标签 (10列)。 Since these are lists, I convert them to Numpy arrays. 由于这些是列表,因此我将它们转换为Numpy数组。 As I go to the next line, I do the same thing and stack the arrays to eventually get 4 conventional datasets: 当我转到下一行时,我做同样的事情并堆叠数组以最终获得4个常规数据集:
x_train, x_test, y_train, y_test x_train,x_test,y_train,y_test
My code is working because I tested it on a small CSV with just 6 rows. 我的代码有效,因为我在只有6行的小型CSV上进行了测试。 But it is taking forever when I run it on the actual dataset of 50,000 rows, after the array initialization, to convert the rows to a data frame. 但是,在数组初始化之后,当我在50,000行的实际数据集上运行它时,要花很多时间才能将行转换为数据帧。
So I was wondering if there is a faster way to go about with this conversion, or is it okay to just wait here! 所以我想知道是否有更快的方法来进行这种转换,还是可以在这里等一下!
Here is my code: 这是我的代码:
import numpy as np
import pandas as pd
import time
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.utils import np_utils
from sklearn.model_selection import train_test_split
# Read the dataset from the CSV file into a dataframe
df = pd.read_csv("bci_dataset_labelled.csv")
start_init = time.time()
xvalues = np.zeros((900,), dtype=np.int)
yvalues = np.zeros((10,), dtype=np.int)
print("--- Arrays initialized in %s seconds ---" % (time.time() - start_init))
start_conversion = time.time()
for row in df.itertuples(index=False):
# separate the neurons from the labels
x = list(row[:900])
y = list(row[900:])
# convert the lists to numpy arrays
x = np.array(x)
y = np.array(y)
xvalues = np.vstack((xvalues, x))
yvalues = np.vstack((yvalues, y))
print("--- CSV rows converted to dataframe in %s seconds ---" % (time.time() - start_conversion))
start_split = time.time()
x_train, x_test, y_train, y_test = train_test_split(xvalues, yvalues, test_size=0.2)
print("--- Dataframe split into training and testing datasets in %s seconds ---" % (time.time() - start_split))
num_classes = y_test.shape[1]
num_neurons = x_train[0].shape[0]
# define baseline model
def baseline_model():
#create model
model = Sequential()
model.add(Dense(
num_neurons,
input_dim = num_neurons,
kernel_initializer = 'normal',
activation = 'relu'
))
model.add(Dense(
num_classes,
kernel_initializer = 'normal',
activation = 'softmax'
))
#compile model
model.compile(
loss = 'categorical_crossentropy',
optimizer = 'adam',
metrics = ['accuracy'])
return model
# build the model
model = baseline_model()
# fit the model
model.fit(x_train, y_train, validation_data = (x_test, y_test),
epochs = 10, batch_size = 200, verbose = 2)
# final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Baseline error: %0.2f%%" % (100-scores[1]*100))
It is just stuck here: 它只是卡在这里:
Rachayitas-MacBook-Pro:bci_hp rachayitagiri$ python3 binarycnn.py
Using TensorFlow backend.
--- Arrays initialized in 2.4080276489257812e-05 seconds ---
Any suggestions will be appreciated! 任何建议将不胜感激! Thank you! 谢谢!
Edit: Putting the output as text from the console, instead of the picture. 编辑:将输出作为文本从控制台而不是图片中放置。 Thank you for the suggestion. 感谢您的建议。
您可能无法击败read_csv ,它是开箱即用的,并且可能比那里的任何其他解决方案都经过更好的测试。
From what I see, your problem is not with the read_csv
function, but with the way you extract the information from the DataFrame. 从我看来,您的问题不在于read_csv
函数,而在于您从DataFrame中提取信息的方式。 You could get xvalues
and yvalues
directly from the DataFrame, instead of reading your DataFrame line after line, which is very costly. 您可以直接从DataFrame获取xvalues
和yvalues
,而不是逐行读取DataFrame,这非常昂贵。 DataFrames allow you to do that in a quite optimized way. DataFrames使您能够以一种非常优化的方式进行操作。
From what I understood your X values are in the 900 first columns and the Y values are after that. 据我了解,您的X值位于前900列中,Y值位于其后。 Here's how I would go about it : 这是我的处理方式:
import pandas as pd
import numpy as np
import time
start_init = time.time()
df = pd.DataFrame(np.random.randint(0,100,size=(50000, 910)))
print("--- DataFrame initialized in %s seconds ---" % (time.time() - start_init))
start_conversion = time.time()
x = df.loc[:, :900] # Here's where you get your x values, 900 first values in each row
y = df.loc[:, 900:] # And here you retrieve the y values
# All that's left is to convert that to a numpy array by doing this
xvalues = x.values
yvalues = y.values
print("--- Took data out of DataFrame in %s seconds ---" % (time.time() -
start_conversion))
print(x.shape, y.shape)
I get the following prints for this code : 我得到以下打印此代码:
--- Arrays initialized in 0.6232161521911621 seconds ---
--- Took data out of DataFrame in 0.038640737533569336 seconds ---
(50000, 901) (50000, 10)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.