迭代 CSV 行的更快方法？

Question

我有一个 CSV 文件，大约 28,000 行 x 785 列。 我需要 1). 分离出列header ，2) 将每行的第一列放入labels数组中，然后 3)。 将每行剩余的 784 列转换为 28x28 矩阵，并将它们的值转换为浮点数后将 append 转换为我的images数组。

有没有更快的方法来遍历我的 CSV？

    images = np.array([])
    labels = np.array([])

    with open(filename) as training_file:
        reader = csv.reader(training_file, delimiter=',')
        header = np.array(next(reader))

        for row in reader:
            label = row[0] # get each row's label

            pixels = row[1:785] # get pixel values of each row
            pixels = np.array(pixels).astype(float) # transform pixel values to floats
            pixels = pixels.reshape(28,28) # turn into 28x28 matrix

            labels = np.append(labels, np.array(label)) # append to labels array
            images = np.append(images, np.array(pixels)) # append to images array

Answer 1

您将使用pandas来读取您的 csv 文件。

import pandas as pd
csv_file = pd.read_csv('file.csv')

这些列由csv_file.name访问。

根据数据大小，您可以按块读取文件：

import pandas as pd
csv_file = pd.read_csv('file.csv', chunksize=1)

无论如何，阅读pandas 文档，我认为这是最好的出路

Answer 2

我认为创建 arrays 很昂贵。 附加到数组会在后台重新创建它们，而且成本也很高。 您可以一次分配所有 memory，例如：

x = np.empty((28000,784))

然后将每一行保存到数组的每一行。 更新阵列非常快并且高度优化。 完成后，您可以更改形状，x.shape = (28000,28,28)。 请注意，数组形状和 memory 分配在 numpy 中断开连接，因此重塑数组不需要任何成本（它只是更新如何访问值，不会移动值）。 这意味着没有理由在追加到数组之前重塑每一行。

Answer 3

迭代几乎不需要时间。 问题是您使用效率极低的方法来创建 arrays。

切勿在numpy.ndarray对象的循环中执行此操作：

labels = np.append(labels, np.array(label)) # append to labels array
images = np.append(images, np.array(pixels)) # append to images array

相反，制作labels和images列表：

labels = []
images = []

然后在你的循环中， append 到列表对象（一个高效的操作） ：

labels.append(np.array(label)) # append to labels list
images.append(np.array(pixels)) # append to images list

最后，在循环完成后，将 arrays 的列表转换为数组：

labels = np.array(labels)
images = np.array(images)

请注意，我不确定您期望的最终 arrays 的形状是什么，您可能需要reshape结果。 您的方法会使每个.append的最终数组变平，因为您没有指定轴...如果这确实是您想要的，那么labels.ravel()最终会让您得到它

Answer 4

正如一些人所建议的那样：

重新创建 arrays 并不断地为他们创建 append 在计算上是昂贵的。 相反，我一开始就创建了空的 arrays。 这使得已经是相对快速的计算变得更快。

    with open(filename) as training_file:
        reader = csv.reader(training_file, delimiter=',')
        header = np.array(next(reader)) # column headers

        row_count = len(list(reader))

        images = np.empty((row_count, 784)) # empty array
        labels = np.empty((row_count,)) # empty array

        for row in reader:
            labels.append(row[0]) # get each row's label
            images.append(row[1:785]) # get pixel values of each row

    labels = labels.astype(float)
    images = images.reshape(-1, 28,28).astype(float)

迭代 CSV 行的更快方法？

问题描述

4 个解决方案

解决方案1
1 2020-04-24 02:06:00

解决方案2
1 2020-04-24 02:25:54

解决方案3
0 2020-04-24 02:11:49

解决方案4
0 已采纳 2020-04-24 03:12:05

迭代 CSV 行的更快方法？

问题描述

4 个解决方案

解决方案1 1 2020-04-24 02:06:00

解决方案2 1 2020-04-24 02:25:54

解决方案3 0 2020-04-24 02:11:49

解决方案4 0 已采纳 2020-04-24 03:12:05

解决方案1
1 2020-04-24 02:06:00

解决方案2
1 2020-04-24 02:25:54

解决方案3
0 2020-04-24 02:11:49

解决方案4
0 已采纳 2020-04-24 03:12:05