简体   繁体   中英

Making sense out of PyArrow

I am browsing the tutorials and the documentation of PyArrow . I see some redundancies, for example, when reading a parquet Dataset (or folder) I could either

type1 = pyarrow.parquet.ParquetDataset("Pqfolder/", use_legacy_dataset=False)
# or
type2 = pyarrow.dataset.dataset('Pqfolder/', format='parquet')

What are pyarrow.parquet and pyarrow.dataset ? Are they modules of the pyarrow package? Where do I find the docs? It looks like pyarrow.dataset is explained in https://arrow.apache.org/docs/python/api/dataset.html and pyarrow.parquet in https://arrow.apache.org/docs/python/parquet.html So i wonder why it is not pyarrow.api.dataset ...

From what I understood the API ( pyarrow.dataset ) also allows you to filter the data with the scanner method, while with pyarrow.parquet I can only do the filtering when I read the file/s with filters but after that I can only read without filtering. Also, filtering is richer thanks to expressions... So, what's the point of having pyarrow.parquet if it can only do a subset of what pyarrow.dataset does (using a different notation)?

The issue here is that I have understood all this by guessing, trials and errors. Is this the standard way in which one learns about new libraries or did I miss some docs? I think I am missing some basics in software design. I was wondering if anyone could point me to some reference about this.

I'm not sure where pyarrow.api.dataset would come from; the docs path is just to delineate the API reference documentation vs the higher-level user documentation. So you actually want https://arrow.apache.org/docs/python/dataset.html .

The Arrow project is working on improving documentation. pyarrow.parquet precedes pyarrow.dataset by quite a bit, and is being reworked to delegate to pyarrow.dataset internally. (You could think of pyarrow.dataset as generalizing pyarrow.parquet.ParquetDataset to non-Parquet files, and potentially to things that aren't files at all.) pyarrow.parquet also has the 'lower level' functions to just read a Parquet file, much like pyarrow.csv does for CSV.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM