[英]Select different dataset when testing | Separate test from production
This question is partly about how to test external dependencies (aka integration tests) and partly how to implement it with Python for SQL with BigQuery in specific.这个问题部分是关于如何测试外部依赖项(又名集成测试),部分是关于如何使用 Python 来实现它,因为 SQL 具体是 BigQuery。 So answers only about 'This is how you should do integration tests' are very welcome.
因此,非常欢迎仅回答“这是您应该如何进行集成测试”的答案。
In my project I have two different datasets在我的项目中,我有两个不同的数据集
'project_1.production.table_1' 'project_1.production.table_1'
'project_1.development.table_1' 'project_1.development.table_1'
When running my tests I would like to call the development environment.在运行我的测试时,我想调用开发环境。 But how to separate it properly from my production code as I don't want to clutter my production code with test(set-up) code.
但是如何正确地将它与我的生产代码分开,因为我不想将我的生产代码与测试(设置)代码混淆。
Production code looks like:生产代码如下所示:
def find_data(variable_x: string) -> DataFrame:
query = '''
SELECT *
FROM `project_1.production.table_1`
WHERE foo = @variable_x
'''
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter(
name='foo', type_="STRING", value=variable_x
)
]
)
df = self.client.query(
query=query, job_config=job_config).to_dataframe()
return df
The python-dotenv module can be used to differentiate production from development, as I do for some parts of my code. python-dotenv 模块可用于区分生产和开发,就像我对代码的某些部分所做的那样。 The problem is that bigQuery does not allow to parameterize the dataset.
问题是 bigQuery 不允许参数化数据集。 (To prevent SQL-injection I think) See running parameterized queries docs
(为了防止 SQL 注入,我认为)请参阅运行参数化查询文档
From the docs从文档
Parameters cannot be used as substitutes for identifiers, column names, table names, or other parts of the query.
参数不能用作标识符、列名、表名或查询的其他部分的替代品。
So having the environment variable as dataset name is not possible.因此,将环境变量作为数据集名称是不可能的。
I could add a if production == True evaluation and select the dataset.我可以添加一个 if production == True 评估和 select 数据集。 However this results in test/debug code in my production code.
但是,这会导致我的生产代码中的测试/调试代码。 I would like to avoid it as much as possible.
我想尽可能地避免它。
from os import getenv
def find_data(variable_x : string) -> Dataframe:
load_dotenv()
PRODUCTION = getenv("PRODUCTION")
if PRODUCTION == TRUE:
*Execute query on project_1.production.table_1*
else:
*Execute query on project_1.development.table_1*
job_config = (*snip*)
df = (*snip*)
return df
Make a copy of the production code and set up the test code so that the development dataset is called.制作生产代码的副本并设置测试代码,以便调用开发数据集。
This leads to duplication of code (one in production code and one in test code).这会导致代码重复(一个在生产代码中,一个在测试代码中)。 A result of this duplication will lead to a mismatch of the code may the implementation of the function change over time.
这种重复的结果会导致代码的不匹配,可能随着时间的推移function的执行发生变化。 So I think this solution is not 'Embracing Change'
所以我认为这个解决方案不是“拥抱变化”
Perhaps this function does not need to be called at all in my test code.也许这个 function 在我的测试代码中根本不需要调用。 Just take a snippet of the result of this query and use the result as a 'data injection' into the tests that depend on this result.
只需获取此查询结果的片段,并将结果用作“数据注入”到依赖此结果的测试中。 However then I need to adjust my architecture a bit.
但是,我需要稍微调整一下我的架构。
The above solutions don't satisfy me completely.上述解决方案并不能完全满足我。 I wonder if there is another way to solve this issue or if one of the above solutions is acceptable?
我想知道是否有另一种方法可以解决这个问题,或者上述解决方案之一是否可以接受?
It looks like string formatting (sometimes referred to as string interpolation) might be enough to get you where you want.看起来字符串格式(有时称为字符串插值)可能足以让您到达您想要的位置。 You could replace the first part of your function by the following code:
您可以用以下代码替换 function 的第一部分:
query = '''
SELECT *
FROM `{table}`
WHERE foo = @variable_x
'''.format(table = getenv("DATA_TABLE"))
This works because the query is just a string and you can do whatever you want with it before you pass it on the the BigQuery library.这是有效的,因为查询只是一个字符串,您可以在将它传递到 BigQuery 库之前对它做任何您想做的事情。 The
String.format
allows us to replace values inside a string, which is exactly what we need (see this article for a more in depth explanation about String.format
) String.format
允许我们替换字符串中的值,这正是我们所需要的(有关String.format
的更深入解释,请参阅本文)
Important security note : it is in general a bad security practice to manipulate SQL queries as plain strings (as we are doing here), but since you control the environment variables of the application it should be safe in this particular case.重要的安全注意事项:通常将 SQL 查询作为纯字符串操作是一种不好的安全做法(正如我们在此处所做的那样),但由于您控制应用程序的环境变量,因此在这种特殊情况下应该是安全的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.