简体   繁体   中英

Reading from BigQuery into a Pandas DataFrame and performance issues

I have around 4M (million) lines that I am reading into a dataframe from BQ, but I find that it no longer seems to be working. As I cannot isolate that something has changed, I want to know if there is anything to change to the code to make it more performant?

My code is the following:

def get_df_categories(table_name):
    query = """
    select cat, ref, engine from `{table_name}`
    """.format(table_name=table_name)
    df = client.query(query).to_dataframe()
    return df

Better read it via list_rows method in batches. In this way you can try to use multithread to read data for a fixed size. This will help you see output much faster and you will be able to handle heavy data loads in a systematic manner. You can also pass which fields you wish to see in the output. This replicates the column names inside your select clause in the sql query. Here is the document that will help you get started. https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM