Reading from BigQuery into a Pandas DataFrame and performance issues

Question

I have around 4M (million) lines that I am reading into a dataframe from BQ, but I find that it no longer seems to be working. As I cannot isolate that something has changed, I want to know if there is anything to change to the code to make it more performant?

My code is the following:

def get_df_categories(table_name):
    query = """
    select cat, ref, engine from `{table_name}`
    """.format(table_name=table_name)
    df = client.query(query).to_dataframe()
    return df

Answer 1

Better read it via list_rows method in batches. In this way you can try to use multithread to read data for a fixed size. This will help you see output much faster and you will be able to handle heavy data loads in a systematic manner. You can also pass which fields you wish to see in the output. This replicates the column names inside your select clause in the sql query. Here is the document that will help you get started. https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.client.Client.html

Reading from BigQuery into a Pandas DataFrame and performance issues

Question

1 answers

solution1
0 2022-03-31 16:10:41

Reading from BigQuery into a Pandas DataFrame and performance issues

Question

1 answers

solution1 0 2022-03-31 16:10:41

solution1
0 2022-03-31 16:10:41