简体   繁体   English

我如何优化此代码以应对更大的数据集?

[英]How can I optimize this code to cope with a larger dataset?

I have a set of images with the same size.我有一组相同大小的图像。 And I want to insert them into a dataframe, with the rows being the names of the images and the columns being the pixels.我想将它们插入到 dataframe 中,行是图像的名称,列是像素。 They are all in the same directory.它们都在同一个目录中。

  • I can already do this for a folder with a few images (as shown in the "Example for 7 images" link below), but when I try it for a dataset with 9912 images, the compile shows "killed".我已经可以为包含几张图片的文件夹执行此操作(如下面的“7 张图片示例”链接所示),但是当我尝试为包含 9912 张图片的数据集执行此操作时,编译显示“已终止”。 How can I optimize this code to get all the images?如何优化此代码以获取所有图像?
from matplotlib import image
import numpy as np
import pandas as pd
import glob


columns = ["file"]
for i in range (150528):
    columns.append("pixel" + str(i))

df = pd.DataFrame(columns = columns)
i = 0
for file in glob.glob('/home/nuno/resizepics/*.jpg'): 
    imgarr = image.imread(file)
    imgarr = imgarr.flatten()

    df.loc[i,"file"] = file
    for j in range(len(imgarr)):
        df.iloc[i,j+1] = imgarr[j]

    i += 1

#print(df)


df.to_csv('pixels.csv')

Example for 7 images 7 张图像的示例

If "killed" means it raises an error you can try using exeptions (try, except, else) and make it try again from the spot it stopped.如果“killed”意味着它会引发错误,您可以尝试使用exeptions (try, except, else)并让它从停止的地方重试。 You can also try to delay it a bit with time module because it works with large data.您也可以尝试使用time模块稍微延迟它,因为它适用于大数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM