简体   繁体   English

什么是更快,更Python化的读取CSV并从中创建数据帧的方法?

[英]What is a faster and more Pythonic way to read the CSV and make a data frame from it?

Input : A CSV with 50,000 rows; 输入 :50,000行的CSV; each row containing 910 columns of value 0/1. 每行包含910列值0/1。
Output : A data frame to run my CNN on. 输出 :运行我的CNN的数据帧。

I wrote a code that reads the CSV line by line. 我写了一段代码,逐行读取CSV。 For each line, I split the data into 2 parts, called neurons (900 columns) and labels (10 columns). 对于每一行,我将数据分为两部分,分别称为神经元 (900列)和标签 (10列)。 Since these are lists, I convert them to Numpy arrays. 由于这些是列表,因此我将它们转换为Numpy数组。 As I go to the next line, I do the same thing and stack the arrays to eventually get 4 conventional datasets: 当我转到下一行时,我做同样的事情并堆叠数组以最终获得4个常规数据集:
x_train, x_test, y_train, y_test x_train,x_test,y_train,y_test

My code is working because I tested it on a small CSV with just 6 rows. 我的代码有效,因为我在只有6行的小型CSV上进行了测试。 But it is taking forever when I run it on the actual dataset of 50,000 rows, after the array initialization, to convert the rows to a data frame. 但是,在数组初始化之后,当我在50,000行的实际数据集上运行它时,要花很多时间才能将行转换为数据帧。

So I was wondering if there is a faster way to go about with this conversion, or is it okay to just wait here! 所以我想知道是否有更快的方法来进行这种转换,还是可以在这里等一下!

Here is my code: 这是我的代码:

import numpy as np
import pandas as pd
import time
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.utils import np_utils
from sklearn.model_selection import train_test_split

# Read the dataset from the CSV file into a dataframe
df = pd.read_csv("bci_dataset_labelled.csv")

start_init = time.time()

xvalues = np.zeros((900,), dtype=np.int)
yvalues = np.zeros((10,), dtype=np.int)

print("--- Arrays initialized in %s seconds ---" % (time.time() - start_init))

start_conversion = time.time()

for row in df.itertuples(index=False):
    # separate the neurons from the labels
    x = list(row[:900])
    y = list(row[900:])

    # convert the lists to numpy arrays
    x = np.array(x) 
    y = np.array(y)

    xvalues = np.vstack((xvalues, x))
    yvalues = np.vstack((yvalues, y))

print("--- CSV rows converted to dataframe in %s seconds ---" % (time.time() - start_conversion))

start_split = time.time()

x_train, x_test, y_train, y_test = train_test_split(xvalues, yvalues, test_size=0.2)

print("--- Dataframe split into training and testing datasets in %s seconds ---" % (time.time() - start_split))

num_classes = y_test.shape[1]
num_neurons = x_train[0].shape[0]

# define baseline model
def baseline_model():
    #create model
    model = Sequential()
    model.add(Dense(
        num_neurons, 
        input_dim = num_neurons,
        kernel_initializer = 'normal',
        activation = 'relu'
    ))
    model.add(Dense(
        num_classes,
        kernel_initializer = 'normal',
        activation = 'softmax'
        ))
    #compile model
    model.compile(
        loss = 'categorical_crossentropy',
        optimizer = 'adam',
        metrics = ['accuracy'])
    return model

# build the model
model = baseline_model()

# fit the model
model.fit(x_train, y_train, validation_data = (x_test, y_test),
    epochs = 10, batch_size = 200, verbose = 2)

# final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Baseline error: %0.2f%%" % (100-scores[1]*100))

It is just stuck here: 它只是卡在这里:

Rachayitas-MacBook-Pro:bci_hp rachayitagiri$ python3 binarycnn.py 
Using TensorFlow backend.
--- Arrays initialized in 2.4080276489257812e-05 seconds ---

Any suggestions will be appreciated! 任何建议将不胜感激! Thank you! 谢谢!

Edit: Putting the output as text from the console, instead of the picture. 编辑:将输出作为文本从控制台而不是图片中放置。 Thank you for the suggestion. 感谢您的建议。

您可能无法击败read_csv ,它是开箱即用的,并且可能比那里的任何其他解决方案都经过更好的测试。

From what I see, your problem is not with the read_csv function, but with the way you extract the information from the DataFrame. 从我看来,您的问题不在于read_csv函数,而在于您从DataFrame中提取信息的方式。 You could get xvalues and yvalues directly from the DataFrame, instead of reading your DataFrame line after line, which is very costly. 您可以直接从DataFrame获取xvaluesyvalues ,而不是逐行读取DataFrame,这非常昂贵。 DataFrames allow you to do that in a quite optimized way. DataFrames使您能够以一种非常优化的方式进行操作。

From what I understood your X values are in the 900 first columns and the Y values are after that. 据我了解,您的X值位于前900列中,Y值位于其后。 Here's how I would go about it : 这是我的处理方式:

import pandas as pd
import numpy as np
import time


start_init = time.time()
df = pd.DataFrame(np.random.randint(0,100,size=(50000, 910)))
print("--- DataFrame initialized in %s seconds ---" % (time.time() - start_init))

start_conversion = time.time()

x = df.loc[:, :900] # Here's where you get your x values, 900 first values in each row
y = df.loc[:, 900:] # And here you retrieve the y values

# All that's left is to convert that to a numpy array by doing this 
xvalues = x.values
yvalues = y.values

print("--- Took data out of DataFrame in %s seconds ---" % (time.time() - 
start_conversion))
print(x.shape, y.shape)

I get the following prints for this code : 我得到以下打印此代码:

--- Arrays initialized in 0.6232161521911621 seconds ---
--- Took data out of DataFrame in 0.038640737533569336 seconds ---
(50000, 901) (50000, 10)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 CSV 文件数据作为命名元组行读取的 Pythonic 方法是什么? - What is the pythonic way to read CSV file data as rows of namedtuples? 根据文件名中的日期模式从目录中读取文件子集的更快和更节省内存的方法是什么? - What's a faster and more memory-efficient way to read_csv a subset of files from a directory based upon a date pattern in their filename? 从字典中获取 N 个项目的更 Pythonic 的方式是什么? - What's a more Pythonic way of grabbing N items from a dictionary? 有没有更“Pythonic”的方式来组合 CSV 元素? - Is there a more “Pythonic” way to combine CSV elements? 有没有办法让 pandas read_csv function 更快 - Is there a way to make pandas read_csv function faster 试图从字典中的列表中保存数据(以更Python化的方式) - trying to save data from a list in a dictionary (in a more pythonic way) 从标记数据中选择多种元素类型的更多pythonic方法 - More pythonic way to pick multiple element types from tagged data 使用行和列标题读取CSV的Pythonic方法 - A Pythonic way to read CSV with row and column headers 用Pythonic的方法可以读取3个csv文件? - Pythonic way to read 3 csv files in a fucntion? 有没有一种更快的方式来写入或读取/读取大约一百万行的熊猫数据帧 - is there a Faster way to write or read in/to with pandas data frame with about 1 million row
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM