How can I safely parameterize table/column names in BigQuery SQL?

Question

I am using python's BigQuery client to create and keep up-to-date some tables in BigQuery that contain daily counts of certain firebase events joined with data from other sources (sometimes grouped by country etc.). Keeping them up-to-date requires the deletion and replacement of data for past days because the day tables for firebase events can be changed after they are created (see here and here ). I keep them up-to-date in this way to avoid querying the entire dataset which is very financially/computationally expensive.

This deletion and replacement process needs to be repeated for many tables and so consequently I need to reuse some queries stored in text files. For example, one deletes everything in the table from a particular date onward ( delete from x where event_date >= y ). But because BigQuery disallows the parameterization of table names (see here ) I have to duplicate these query text files for each table I need to do this for. If I want to run tests I would also have to duplicate the aforementioned queries for test tables too.

I basically need something like psycopg2.sql for bigquery so that I can safely parameterize table and column names whilst avoiding SQLi. I actually tried to repurpose this module by calling the as_string() method and using the result to query BigQuery. But the resulting syntax doesn't match and I need to start a postgres connection to do it ( as_string() expects a cursor/connection object). I also tried something similar with sqlalchemy.text to no avail. So I concluded I'd have to basically implement some way of parameterizing the table name myself, or implement some workaround using the python client library. Any ideas of how I should go about doing this in a safe way that won't lead to SQLi? Cannot go into detail but unfortunately I cannot store the tables in postgres or any other db.

Answer 1

As discussed in the comments, the best option for avoiding SQLi in your case is ensuring your server's security.

If anyway you need/want to parse your input parameter before building your query, I recommend you to use REGEX in order to check the input strings. In Python you could use the re library.

As I don't know how your code works, how your datasets/tables are organized and I don't know exactly how you are planing to check if the string is a valid source, I created the basic example below that shows how you could check a string using this library

import re

tests = ["your-dataset.your-table","(SELECT * FROM <another table>)", "dataset-09123.my-table-21112019"]

#Supposing that the input pattern is <dataset>.<table>
regex = re.compile("[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+")

for t in tests:
    if(regex.fullmatch(t)):
        print("This source is ok")
    else:
        print("This source is not ok")

In this example, only strings that matches the configuration dataset.table (where both the dataset and the table must contain only alphanumeric characters and dashes) will be considered as valid.

When running the code, the first and the third elements of the list will be considered valid while the second (that could potentially change your whole query) will be considered invalid.

How can I safely parameterize table/column names in BigQuery SQL?

Question

1 answers

solution1
1 2020-05-20 14:43:03

How can I safely parameterize table/column names in BigQuery SQL?

Question

1 answers

solution1 1 2020-05-20 14:43:03

solution1
1 2020-05-20 14:43:03