简体   繁体   中英

Query pyarrow dataset from GCP bucket is extremely slow

I am using pyarrow dataset to Query a parquet file in GCP, the code is straightforward

import pyarrow.dataset as ds
import duckdb
import json
lineitem = ds.dataset("gs://duckddelta/lineitem",format="parquet")
lineitem_partition = ds.dataset("gs://duckddelta/delta2",format="parquet", partitioning="hive")
con = duckdb.connect()
def Query(request):
    SQL = request.get_json().get('name')
    df = con.execute(SQL).df()
    return json.dumps(df.to_json(orient="records")), 200, {'Content-Type': 'application/json'}

then I call that function using a SQL Query

SQL = '''
    SELECT
    l_returnflag,
    l_linestatus,
    SUM(l_quantity) AS sum_qty,
    SUM(l_extendedprice) AS sum_base_price,
    SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
    SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    AVG(l_quantity) AS avg_qty,
    AVG(l_extendedprice) AS avg_price,
    AVG(l_discount) AS avg_disc,
    COUNT(*) AS count_order
FROM
    lineitem 
GROUP BY 1,2
ORDER BY 1,2 ;
    '''

I know that local SSD storage is faster but I am getting some massive difference

The Query take 4 second, when the file is saved in my laptop it take 54 second when run from google cloud function in the same region take 3 minutes when I run it in Colab it seems to me there is a bottleneck somewhere in google cloud function, I was expected a better performance

edit for more context: File is 1.2 GB, region is us-central1 (Iowa), cloud function gen 2, 8 GB, 8 CPU

sorry turn out it is a possible bug in DuckDB, apparently it DuckDB is not multi threaded in this particular scenario https://github.com/duckdb/duckdb/issues/4525

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM