[英]how to use csv data to train a hugging face model?
我想微调一个预训练好的DistilBERT(基于BERT架构的transformer model)model可用Hugging Face。 我做了一些数据清理/预处理步骤以生成 csv 数据并上传到 s3 存储桶。
基于此处提供的示例 ( https://github.com/aws-samples/f.netune-deploy-bert-with-amazon-sagemaker-for-hugging-face ),下面的代码是一个 train.py 文件。
我有几个 csv 文件,我想将其用于培训和测试。 在下面的代码中,它看起来像在加载数据,如下所示,鉴于 csv 是一个 s3 位置,我如何更改它才能读取和使用 csv。
train_dataset = load_from_disk(args.training_dir)
"""
Training script for Hugging Face SageMaker Estimator
"""
import logging
import sys
import argparse
import os
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers import Trainer, TrainingArguments
from datasets import load_from_disk
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument("--epochs", type=int, default=3)
parser.add_argument("--train_batch_size", type=int, default=32)
parser.add_argument("--eval_batch_size", type=int, default=64)
parser.add_argument("--warmup_steps", type=int, default=500)
parser.add_argument("--model_name", type=str)
parser.add_argument("--tokenizer_name", type=str)
parser.add_argument("--learning_rate", type=str, default=5e-5)
# Data, model, and output directories
parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
parser.add_argument("--n_gpus", type=str, default=os.environ["SM_NUM_GPUS"])
parser.add_argument("--training_dir", type=str, default=os.environ["SM_CHANNEL_TRAIN"])
parser.add_argument("--test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])
args, _ = parser.parse_known_args()
# load datasets
train_dataset = load_from_disk(args.training_dir)
test_dataset = load_from_disk(args.test_dir)
# download model and tokenizer from model hub
model = AutoModelForSequenceClassification.from_pretrained(args.model_name)
tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name)
# define training args
training_args = TrainingArguments(
output_dir=args.model_dir,
num_train_epochs=args.epochs,
per_device_train_batch_size=args.train_batch_size,
per_device_eval_batch_size=args.eval_batch_size,
warmup_steps=args.warmup_steps,
evaluation_strategy="epoch",
logging_dir=f"{args.output_data_dir}/logs",
learning_rate=float(args.learning_rate),
)
# create Trainer instance
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
tokenizer=tokenizer,
)
# train model
trainer.train()
...
...
您可以将 S3 远程 url 传递给 function load_from_disk
。
参数是下面描述的 dataset_path。
dataset_path (str) — 将从中加载数据集的数据集目录的路径(例如“dataset/train”)或远程 URI(例如“s3//my-bucket/dataset/train”)。 参考 - https://huggingface.co/docs/datasets/v2.8.0/en/package_reference/main_classes#datasets.Dataset.load_from_disk
from datasets import load_from_disk
# load encoded_dataset from cloud storage
dataset = load_from_disk("s3://a-public-datasets/imdb/train", storage_options=storage_options)
print(len(dataset))
25000
为了传递 S3 session 的详细信息,您可以查看下面的文档。 https://huggingface.co/docs/datasets/filesystems#amazon-s3
storage_options = {"anon": True} # for anonymous connection
# or use your credentials
storage_options = {"key": aws_access_key_id, "secret": aws_secret_access_key} # for private buckets
# or use a botocore session
import botocore
s3_session = botocore.session.Session(profile="my_profile_name")
storage_options = {"session": s3_session}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.