Reading text files into Dask DataFrame

Question

I'm looking to train a model on ~100,000 text files. Pandas was running into some memory issues to decided to move to Dask.

I'm trying to read files into a dask DataFrame in which the file paths are already stored. In pandas I could simply do the following:

ddf['rawtext'] = [open(file, 'rt').read() for file in ddf['filepath']]

But this gives a NotImplementedError error.

Is there a way to efficiently read text files into Dask?

Answer 1

What you can do in Pandas you can do in Dask using map or map_partitions

def read_them(df):
    df['rawtext'] = [open(file, 'rt').read() for file in ddf['filepath']]
    return df

ddf2 = ddf.map_partitions(read_them)

OR

ddf2 = ddf.assign(
    raw_text=ddf.filepath.map(lambda x: open(x, 'rt').read())
)

The first option may be more characters, but it feels simpler and more closely matches your original code. Whatever (row-wise) processing you wanted to do to your text next, you could still do in the same function.

Answer 2

in pandas you just need to supply a path to a single file and it will handle the I/O operations for you, thre is no need to open each file and pass it into a list.

dask, like many other big frameworks can accept a directory of objects and read them in one go.

from the docs .

!ls data/*.csv | head

data/2000-01-01.csv
data/2000-01-02.csv
data/2000-01-03.csv
data/2000-01-04.csv
data/2000-01-05.csv
data/2000-01-06.csv
data/2000-01-07.csv
data/2000-01-08.csv
data/2000-01-09.csv
data/2000-01-10.csv

dd.read_csv('data/2000-*-*.csv')

in your case I would assume it to be

dd.read_csv('data/*.txt')

Reading text files into Dask DataFrame

Question

2 answers

solution1
1 ACCPTED 2020-12-18 21:45:27

solution2
0 2020-12-18 21:08:55

Reading text files into Dask DataFrame

Question

2 answers

solution1 1 ACCPTED 2020-12-18 21:45:27

solution2 0 2020-12-18 21:08:55

solution1
1 ACCPTED 2020-12-18 21:45:27

solution2
0 2020-12-18 21:08:55