如何使用 zip 文件中的 kaggle 数据集？

Question

I am working on this kaggle dataset from 'APTOS 2019 Blindness Detection' and the dataset is inside a zip file.我正在处理来自“APTOS 2019 Blindness Detection”的这个 kaggle 数据集，该数据集位于一个 zip 文件中。 I want to preprocess the dataset to feed into a deep learning model.我想预处理数据集以输入深度学习模型。

My code looks like this:我的代码如下所示：

train_dir = '../input/train_images'
train_labels = pd.read_csv('../input/train.csv')
train_labels['diagnosis'] = train_labels['diagnosis'].astype(str)

test_dir = '../input/test_images'

then to preprocess I wrote:然后预处理我写道：

from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(
    rotation_range=40, 
    width_shift_range=0.2, 
    height_shift_range=0.2, 
    shear_range=0.2, 
    zoom_range=0.2,
    horizontal_flip=True, 
    vertical_flip=True, 
    rescale=1./255,)

test_datagen = ImageDataGenerator(rescale = 1./255)

train_generator = train_datagen.flow_from_dataframe(
    train_labels[:3295], 
    directory=train_dir, 
    x_col='id_code', y_col='diagnosis', 
    target_size=(150, 150), 
    color_mode='rgb', 
    class_mode='categorical', 
    batch_size=32, 
    shuffle=True,)

validation_generator = test_datagen.flow_from_dataframe(
    train_labels[3295:], 
    directory=train_dir, 
    x_col='id_code', y_col='diagnosis', 
    target_size=(150, 150), 
    color_mode='rgb', 
    class_mode='categorical', 
    batch_size=32, 
    shuffle=True,)

But when I run the code.但是当我运行代码时。 I get the results saying:我得到的结果是：

Found 0 validated image filenames belonging to 0 classes.找到 0 个经过验证的图像文件名，属于 0 个类。
Found 0 validated image filenames belonging to 0 classes.找到 0 个经过验证的图像文件名，属于 0 个类。

I have also tried unzipping the files but it wont unzip saying我也试过解压缩文件，但它不会解压缩说
FileNotFoundError: [Errno 2] No such file or directory: 'train_images.zip' FileNotFoundError: [Errno 2] 没有这样的文件或目录：'train_images.zip'

# importing required modules 
from zipfile import ZipFile 

# specifying the zip file name 
file_name = "../input/train_images.zip"

# opening the zip file in READ mode 
with ZipFile(file_name, 'r') as zip: 

    # extracting all the files 
    print('Extracting all the files now...') 
    zip.extractall()

So can someone help me fix this problem?那么有人可以帮我解决这个问题吗？ I will appreciate it我会很感激

Answer 1

You have the images already unzipped in the directory ../home/train_images您已经在目录../home/train_images解压了图像
Run this in your kernel:在你的内核中运行这个：

from os import listdir
listdir('../input/train_images/')

Use ImageDataGenerator.flow_from_directory() to use the images in the directory with your generator.使用ImageDataGenerator.flow_from_directory()将目录中的图像与生成器一起使用。
Check Keras docs: https://keras.io/preprocessing/image/#imagedatagenerator-methods检查 Keras 文档： https ://keras.io/preprocessing/image/#imagedatagenerator-methods

Answer 2

I got stuck with this on kaggle today!我今天在 kaggle 上遇到了这个问题！ It was first time I looked at dataset that was archived.这是我第一次查看已存档的数据集。

Now I know people said oh just do listdir('../input/') and you will see them!现在我知道人们说哦，只要执行 listdir('../input/') ，你就会看到他们！ Or look at '../input/train_images/' But all I found were the zip files and the CSVs!或者看看'../input/train_images/' 但我发现的只是 zip 文件和 CSV！

So what I did was extract the zipped training and testing datasets to the kaggle working directory.所以我所做的是将压缩的训练和测试数据集提取到 kaggle 工作目录。

So this was for aerial-cactus-detection.所以这是用于空中仙人掌检测。 The input directory looks like /input/aerial-cactus-detection/ and has train.zip, test.zip, and train.csv (filenames + classes).输入目录看起来像 /input/aerial-cactus-detection/ 并且有 train.zip、test.zip 和 train.csv（文件名 + 类）。

I went ahead and我继续前进

import zipfile

Dataset = "train"


with zipfile.ZipFile("../input/aerial-cactus-identification/"+Dataset+".zip","r") as z:
    z.extractall(".")

print(os.listdir("../working/"))

And yup it is extracted to working directory.是的，它被提取到工作目录。 And the same thing for test.zip: test.zip 也是一样：

Dataset = "test"


with zipfile.ZipFile("../input/aerial-cactus-identification/"+Dataset+".zip","r") as z:
    z.extractall(".")

print(os.listdir("../working/"))

I read the CSVs earlier:我之前阅读了 CSV：

traindf=pd.read_csv('../input/aerial-cactus-identification/train.csv',dtype=str)

testdf=pd.read_csv('../input/aerial-cactus-identification/sample_submission.csv',dtype=str)

So I just go use flow_from_dataframe after extracting the archives:所以我只是在提取档案后使用 flow_from_dataframe ：

train_generator=datagen.flow_from_dataframe(
dataframe=traindf,
directory="../working/train/",
x_col="id",
y_col="has_cactus",
subset="training",
batch_size=32,
seed=42,
shuffle=True,
class_mode="binary",
target_size=(150,150))

My notebook for it is public and is here我的笔记本是公开的，在这里

如何使用 zip 文件中的 kaggle 数据集？

问题描述

2 个解决方案

解决方案1
1 2019-07-04 10:34:54

解决方案2
0 2020-06-09 20:12:28

如何使用 zip 文件中的 kaggle 数据集？

问题描述

2 个解决方案

解决方案1 1 2019-07-04 10:34:54

解决方案2 0 2020-06-09 20:12:28

解决方案1
1 2019-07-04 10:34:54

解决方案2
0 2020-06-09 20:12:28