简体   繁体   中英

What is a faster and more Pythonic way to read the CSV and make a data frame from it?

Input : A CSV with 50,000 rows; each row containing 910 columns of value 0/1.
Output : A data frame to run my CNN on.

I wrote a code that reads the CSV line by line. For each line, I split the data into 2 parts, called neurons (900 columns) and labels (10 columns). Since these are lists, I convert them to Numpy arrays. As I go to the next line, I do the same thing and stack the arrays to eventually get 4 conventional datasets:
x_train, x_test, y_train, y_test

My code is working because I tested it on a small CSV with just 6 rows. But it is taking forever when I run it on the actual dataset of 50,000 rows, after the array initialization, to convert the rows to a data frame.

So I was wondering if there is a faster way to go about with this conversion, or is it okay to just wait here!

Here is my code:

import numpy as np
import pandas as pd
import time
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.utils import np_utils
from sklearn.model_selection import train_test_split

# Read the dataset from the CSV file into a dataframe
df = pd.read_csv("bci_dataset_labelled.csv")

start_init = time.time()

xvalues = np.zeros((900,), dtype=np.int)
yvalues = np.zeros((10,), dtype=np.int)

print("--- Arrays initialized in %s seconds ---" % (time.time() - start_init))

start_conversion = time.time()

for row in df.itertuples(index=False):
    # separate the neurons from the labels
    x = list(row[:900])
    y = list(row[900:])

    # convert the lists to numpy arrays
    x = np.array(x) 
    y = np.array(y)

    xvalues = np.vstack((xvalues, x))
    yvalues = np.vstack((yvalues, y))

print("--- CSV rows converted to dataframe in %s seconds ---" % (time.time() - start_conversion))

start_split = time.time()

x_train, x_test, y_train, y_test = train_test_split(xvalues, yvalues, test_size=0.2)

print("--- Dataframe split into training and testing datasets in %s seconds ---" % (time.time() - start_split))

num_classes = y_test.shape[1]
num_neurons = x_train[0].shape[0]

# define baseline model
def baseline_model():
    #create model
    model = Sequential()
    model.add(Dense(
        num_neurons, 
        input_dim = num_neurons,
        kernel_initializer = 'normal',
        activation = 'relu'
    ))
    model.add(Dense(
        num_classes,
        kernel_initializer = 'normal',
        activation = 'softmax'
        ))
    #compile model
    model.compile(
        loss = 'categorical_crossentropy',
        optimizer = 'adam',
        metrics = ['accuracy'])
    return model

# build the model
model = baseline_model()

# fit the model
model.fit(x_train, y_train, validation_data = (x_test, y_test),
    epochs = 10, batch_size = 200, verbose = 2)

# final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Baseline error: %0.2f%%" % (100-scores[1]*100))

It is just stuck here:

Rachayitas-MacBook-Pro:bci_hp rachayitagiri$ python3 binarycnn.py 
Using TensorFlow backend.
--- Arrays initialized in 2.4080276489257812e-05 seconds ---

Any suggestions will be appreciated! Thank you!

Edit: Putting the output as text from the console, instead of the picture. Thank you for the suggestion.

您可能无法击败read_csv ,它是开箱即用的,并且可能比那里的任何其他解决方案都经过更好的测试。

From what I see, your problem is not with the read_csv function, but with the way you extract the information from the DataFrame. You could get xvalues and yvalues directly from the DataFrame, instead of reading your DataFrame line after line, which is very costly. DataFrames allow you to do that in a quite optimized way.

From what I understood your X values are in the 900 first columns and the Y values are after that. Here's how I would go about it :

import pandas as pd
import numpy as np
import time


start_init = time.time()
df = pd.DataFrame(np.random.randint(0,100,size=(50000, 910)))
print("--- DataFrame initialized in %s seconds ---" % (time.time() - start_init))

start_conversion = time.time()

x = df.loc[:, :900] # Here's where you get your x values, 900 first values in each row
y = df.loc[:, 900:] # And here you retrieve the y values

# All that's left is to convert that to a numpy array by doing this 
xvalues = x.values
yvalues = y.values

print("--- Took data out of DataFrame in %s seconds ---" % (time.time() - 
start_conversion))
print(x.shape, y.shape)

I get the following prints for this code :

--- Arrays initialized in 0.6232161521911621 seconds ---
--- Took data out of DataFrame in 0.038640737533569336 seconds ---
(50000, 901) (50000, 10)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM