I'm looking to train a model on ~100,000 text files. Pandas was running into some memory issues to decided to move to Dask.
I'm trying to read files into a dask DataFrame in which the file paths are already stored. In pandas I could simply do the following:
ddf['rawtext'] = [open(file, 'rt').read() for file in ddf['filepath']]
But this gives a NotImplementedError
error.
Is there a way to efficiently read text files into Dask?
What you can do in Pandas you can do in Dask using map
or map_partitions
def read_them(df):
df['rawtext'] = [open(file, 'rt').read() for file in ddf['filepath']]
return df
ddf2 = ddf.map_partitions(read_them)
OR
ddf2 = ddf.assign(
raw_text=ddf.filepath.map(lambda x: open(x, 'rt').read())
)
The first option may be more characters, but it feels simpler and more closely matches your original code. Whatever (row-wise) processing you wanted to do to your text next, you could still do in the same function.
in pandas you just need to supply a path to a single file and it will handle the I/O
operations for you, thre is no need to open each file and pass it into a list.
dask, like many other big frameworks can accept a directory of objects and read them in one go.
from the docs .
!ls data/*.csv | head
data/2000-01-01.csv
data/2000-01-02.csv
data/2000-01-03.csv
data/2000-01-04.csv
data/2000-01-05.csv
data/2000-01-06.csv
data/2000-01-07.csv
data/2000-01-08.csv
data/2000-01-09.csv
data/2000-01-10.csv
dd.read_csv('data/2000-*-*.csv')
in your case I would assume it to be
dd.read_csv('data/*.txt')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.