I am browsing the tutorials and the documentation of PyArrow
. I see some redundancies, for example, when reading a parquet Dataset (or folder) I could either
type1 = pyarrow.parquet.ParquetDataset("Pqfolder/", use_legacy_dataset=False)
# or
type2 = pyarrow.dataset.dataset('Pqfolder/', format='parquet')
What are pyarrow.parquet
and pyarrow.dataset
? Are they modules of the pyarrow
package? Where do I find the docs? It looks like pyarrow.dataset
is explained in https://arrow.apache.org/docs/python/api/dataset.html and pyarrow.parquet
in https://arrow.apache.org/docs/python/parquet.html So i wonder why it is not pyarrow.api.dataset
...
From what I understood the API ( pyarrow.dataset
) also allows you to filter the data with the scanner
method, while with pyarrow.parquet
I can only do the filtering when I read the file/s with filters
but after that I can only read
without filtering. Also, filtering is richer thanks to expressions... So, what's the point of having pyarrow.parquet
if it can only do a subset of what pyarrow.dataset
does (using a different notation)?
The issue here is that I have understood all this by guessing, trials and errors. Is this the standard way in which one learns about new libraries or did I miss some docs? I think I am missing some basics in software design. I was wondering if anyone could point me to some reference about this.
I'm not sure where pyarrow.api.dataset
would come from; the docs path is just to delineate the API reference documentation vs the higher-level user documentation. So you actually want https://arrow.apache.org/docs/python/dataset.html .
The Arrow project is working on improving documentation. pyarrow.parquet
precedes pyarrow.dataset
by quite a bit, and is being reworked to delegate to pyarrow.dataset
internally. (You could think of pyarrow.dataset
as generalizing pyarrow.parquet.ParquetDataset
to non-Parquet files, and potentially to things that aren't files at all.) pyarrow.parquet
also has the 'lower level' functions to just read a Parquet file, much like pyarrow.csv
does for CSV.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.