简体   繁体   English

将 pandas dataframe 转换为火炬数据集

[英]Converting a pandas dataframe into a torch Dataset

I have a pandas dataframe with the following structure:我有一个 pandas dataframe 具有以下结构:

path小路 sentence句子 speech演讲 input_values输入值 labels标签
audio1.mp3音频1.mp3 This is the first audio这是第一个音频 [[0.0, 0.0, 0.0, ..., 0.0, 0.0]] [[0.0, 0.0, 0.0, ..., 0.0, 0.0]] [[0.00005, ..., 0.0003]] [[0.00005, ..., 0.0003]] [23, 4, 6, 11, ..., 12 [23, 4, 6, 11, ..., 12
audio2.mp3音频2.mp3 This is the second audio这是第二个音频 [[0.0, 0.0, 0.0, ..., 0.0, 0.0]] [[0.0, 0.0, 0.0, ..., 0.0, 0.0]] [[0.000044, ..., 0.00033]] [[0.000044, ..., 0.00033]] [23, 4, 6, 11, ..., 12 [23, 4, 6, 11, ..., 12

The sentence is the transcription of the audio, the speech column is the array representation of the audio, and labels is the number representation of the each letter of the sentence based on a defined vocab list.句子是音频的转录,语音列是音频的数组表示,标签是基于定义的词汇列表的句子中每个字母的数字表示。

I'm fine-tuning a pre-trained ASR model, but when I try to pass the pandas df to the Trainer class and call .train() on it, it errors out (KeyError: 0).我正在微调预训练的 ASR model,但是当我尝试将 pandas df 传递给训练器 class 并对其调用.train()时,它会出错(KeyError:0)。 From the documentation, it only accepts torch.utils.data.Dataset or torch.utils.data.IterableDataset as train_/eval_dataset arguments. This is how my Trainer definition looks like:从文档中,它只接受torch.utils.data.Datasettorch.utils.data.IterableDataset作为 train_/eval_dataset arguments。这就是我的 Trainer 定义:

trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=ds_train, 
    eval_dataset=ds_test,
    tokenizer=processor.feature_extractor
)

ds_train and ds_test are my training and validation dataframes respectively. ds_train 和 ds_test 分别是我的训练和验证数据帧。 I just split my main dataframe (80/20).我刚刚拆分了我的主 dataframe (80/20)。 How can I convert my pandas dataframes into the required Dataset type?如何将我的 pandas 数据帧转换为所需的数据集类型? I tried tailoring the data_collator class definition to a pandas df but that predictably didn't work either.我尝试将data_collator class 定义裁剪为 pandas df,但可以预见这也不起作用。 I'm assuming the train and eval datasets both call the data_collator class when you call .train() on the trainer?我假设当您在训练器上调用.train()时,训练和评估数据集都调用data_collator or class?

EDIT : I tried using Dataset.from_pandas(ds_train) but it couldn't convert it because I had columns with two-dimensional arrays and it can apparently only convert one-dimensional array values.编辑:我尝试使用Dataset.from_pandas(ds_train)但它无法转换它,因为我有二维 arrays 的列,它显然只能转换一维数组值。

Depends on how you will use your labels column.取决于您将如何使用labels列。 I don't know how your your trainer use these data but I suggest to define your own Dataset class ( https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-your-files )我不知道你的教练如何使用这些数据,但我建议定义你自己的数据集 class ( https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#creating-a-custom-dataset-for-你的文件

class CustomDataset(Dataset):
    def __init__(self, dataframe):
        self.path = dataframe["path"]
        self.sentence = dataframe["sentence"]
        self.speech = dataframe["speech"]
        self.input_values = dataframe["input_values"]
        self.labels = dataframe["labels"]

    def __len__(self):
        return len(self.text)

    def __getitem__(self, idx):
        path = self.path.iloc[idx]
        sentence = self.sentence.iloc[idx]
        speech = self.speech.iloc[idx]
        input_values = self.input_values .iloc[idx]
        labels = self.labels.iloc[idx]
        return path, sentence, speech, input_values, labels

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM