I am using pyarrow dataset to Query a parquet file in GCP, the code is straightforward
import pyarrow.dataset as ds
import duckdb
import json
lineitem = ds.dataset("gs://duckddelta/lineitem",format="parquet")
lineitem_partition = ds.dataset("gs://duckddelta/delta2",format="parquet", partitioning="hive")
con = duckdb.connect()
def Query(request):
SQL = request.get_json().get('name')
df = con.execute(SQL).df()
return json.dumps(df.to_json(orient="records")), 200, {'Content-Type': 'application/json'}
then I call that function using a SQL Query
SQL = '''
SELECT
l_returnflag,
l_linestatus,
SUM(l_quantity) AS sum_qty,
SUM(l_extendedprice) AS sum_base_price,
SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
AVG(l_quantity) AS avg_qty,
AVG(l_extendedprice) AS avg_price,
AVG(l_discount) AS avg_disc,
COUNT(*) AS count_order
FROM
lineitem
GROUP BY 1,2
ORDER BY 1,2 ;
'''
I know that local SSD storage is faster but I am getting some massive difference
The Query take 4 second, when the file is saved in my laptop it take 54 second when run from google cloud function in the same region take 3 minutes when I run it in Colab it seems to me there is a bottleneck somewhere in google cloud function, I was expected a better performance
edit for more context: File is 1.2 GB, region is us-central1 (Iowa), cloud function gen 2, 8 GB, 8 CPU
sorry turn out it is a possible bug in DuckDB, apparently it DuckDB is not multi threaded in this particular scenario https://github.com/duckdb/duckdb/issues/4525
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.