I tried to run this code:
from tqdm.auto import tqdm
import os
from datasets import load_dataset
dataset = load_dataset('oscar', 'unshuffled_deduplicated_ar', split='train[:25%]')
text_data = []
file_count = 0
for sample in tqdm(dataset['train']):
sample = sample['text'].replace('\n', ' ')
text_data.append(sample)
if len(text_data) == 10_000:
# once we git the 10K mark, save to file
filename = f'/data/text/oscar_ar/text_{file_count}.txt'
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, 'w', encoding='utf-8') as fp:
fp.write('\n'.join(text_data))
text_data = []
file_count += 1
# after saving in 10K chunks, we will have ~2082 leftover samples, we save those now too
with open(f'data/text/oscar_ar/text_{file_count}.txt', 'w', encoding='utf-8') as fp:
fp.write('\n'.join(text_data))
and i get following PermissionError:
I've tried changing rights to this directory and running jupyter with sudo privilages but it still doesn't work.
You are opening:
with open(f'data/text/oscar_ar/text_{file_count}.txt')
But you are writing:
filename = f'/Dane/text/oscar_ar/text_{file_count}.txt'
And you're screenshot says:
filename = f'/date/text/oscar_ar/text_{file_count}.txt'
You have to make a choice between data
, /date
or /Dane
:)
Also It seems you should remove the first /
in /data/text/oscar_ar/text_{file_count}.txt
.
Explanation: When you put a slash ( /
) at the begin of a path, that means to look from the root of the filesystem, the top level. If you don't put the slash, it will start looking from your current directory.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.