简体   繁体   English

Power BI Athena 增量刷新

[英]Power BI Athena Incremental Refresh

I have been successfully using Power BI's incremental refresh daily with a MySQL data source.我已经成功地将 Power BI 的每日增量刷新与 MySQL 数据源一起使用。 However, I can't get this configured with AWS Athena, because seemingly the latter interprets the values in the required parameters RangeStart and RangeEnd as strings.但是,我无法使用 AWS Athena 进行配置,因为似乎后者将所需参数RangeStartRangeEnd的值解释为字符串。 Since the data source is around 50 million rows I'd rather avoid querying this from scratch every day.由于数据源大约有 5000 万行,我宁愿避免每天从头开始查询。

In this video from Guy in a Cube , you can clearly see that the query sent by Power BI to Azure has a convert to datetime2 function - something like this is presumably missing for Athena/Presto, which needs the type constructor TIMESTAMP in order to do datetime comparisons ( https://stackoverflow.com/a/38041684/3675679 ), and of course incremental refresh must be based on datetime fields.这个来自Guy in a Cube 的视频,您可以清楚地看到 Power BI 发送到 Azure 的查询具有转换为 datetime2 的功能 - Athena/Presto 可能缺少这样的功能,它需要类型构造函数 TIMESTAMP 才能执行日期时间比较( https://stackoverflow.com/a/38041684/3675679 ),当然增量刷新必须基于日期时间字段。 I am using the datetime field adv_date for the incremental load.我正在使用日期时间字段adv_date进行增量加载。

Here is what the M query looks like in Power Query Editor:下面是 M 查询在 Power Query 编辑器中的样子:

= Table.SelectRows(#"Removed Columns1", each [adv_date] >= RangeStart and [adv_date] < RangeEnd) 

And here is the resultant error message in Athena:这是 Athena 中由此产生的错误消息:

Your query has the following errors:SYNTAX_ERROR: line 1:1: Incorrect number of parameters: expected 2 but found 0 

Whilst this is how Athena interprets the query:虽然这是 Athena 解释查询的方式:

    select "col1", "col2", "adv_date" 
    from "AwsDataCatalog"."test"."test_table" 
    where "adv_date" >= ? and "adv_date" < ?

I have contacted Power BI support without success.我已联系 Power BI 支持但没有成功。 Does anyone have a workaround for this by any chance?有没有人有任何机会解决这个问题? Happy to provide more info if needed.如果需要,很乐意提供更多信息。

So I have an answer of sorts - I don't believe that it is currently possible to set up Athena as an incremental source in Power BI, using a standard connection.所以我有各种各样的答案 - 我不相信目前可以使用标准连接将 Athena 设置为 Power BI 中的增量源。

However, it is possible to do this via a dataflow, with the caveat that for our environment it was not particularly fast.但是,可以通过数据流来做到这一点,但需要注意的是,对于我们的环境,它并不是特别快。 However it does work.但是它确实有效。

A guy from Microsoft directed me to use the Odbc.Query rather than to use Odbc.Datasource.微软的一个人指示我使用 Odbc.Query 而不是使用 Odbc.Datasource。 Here is an example from the URL he sent:这是他发送的URL 中的一个示例:

let
Source = Odbc.Query("dsn=Google BigQuery", "SELECT line_of_business, category_group FROM masterdata.item_d WHERE line_of_business in ('" & LOB & "')")
in
Source

I have tried this and it worked, maybe you can use this also.我已经试过了,它奏效了,也许你也可以使用它。

Direct query also works for me, but I eventually just moved the filters to a view inside Athena - PBI can't be trusted to handle stuff like this, sadly.直接查询也适用于我,但我最终只是将过滤器移到 Athena 内部的视图中 - 遗憾的是,不能信任 PBI 来处理这样的事情。

Anyway, there is a (sort of) workaround for M queries, in case someone else need it: I found out that if you add certain steps before the filter, Power BI will not try any query folding, therefore not messing up the SQL it sends to Athena.无论如何,对于 M 查询有一种(某种)解决方法,以防其他人需要它:我发现如果您在过滤器之前添加某些步骤,Power BI 将不会尝试任何查询折叠,因此不会弄乱 SQL发送到雅典娜。 In my case, I added a duplicated column and renamed it.就我而言,我添加了一个重复的列并将其重命名。 PBI will, of course, still load all the data , because of course it will, but it will dump it once que query finishes fetching data.当然,PBI仍然会加载所有数据,因为它当然会,但是一旦 que 查询完成获取数据,它就会转储它。 This way at least we can save space on the file, even if loading time stays the same.这样至少我们可以节省文件空间,即使加载时间保持不变。

Sorry if I sound frustrated in this answer - the reason is that I am incredibly frustrated with Power BI.对不起,如果我对这个答案感到沮丧 - 原因是我对 Power BI 感到非常沮丧。

I think you are trying to fix Filtered Rows step, but might be able to achieve incremental load by fixing Step 1 - Source (running actual direct query to Athena)我认为您正在尝试修复Filtered Rows步骤,但可能能够通过修复 Step 1 - Source(运行对 Athena 的实际直接查询)来实现增量加载

Pasting my answer on this from another question thread :从另一个问题线程粘贴我对此的回答:

I think I have managed to achieve the "Incremental Load" in Power BI using Athena.我想我已经设法使用 Athena 在 Power BI 中实现了“增量负载”。 This (still) does not allow you to view Native query but you can still make Power BI manipulate the direct query to implement it.这(仍然)不允许您查看本机查询,但您仍然可以让 Power BI 操作直接查询来实现它。

To avoid full scan of S3 data in Athena - you have to enable Partitions in your dataset.为了避免在 Athena 中完全扫描 S3 数据 - 您必须在数据集中启用分区 Without going off topic, once you partition the S3 data via Athena you can then pin point the datasets with days/months/years without scanning your whole dataset.在不偏离主题的情况下,一旦您通过 Athena 对 S3 数据进行分区,您就可以在不扫描整个数据集的情况下以天/月/年确定数据集。

Once you do that, you can achieve the Incremental Load by running Direct Queries as mentioned in this video (20:00 onwards) and achieve resource-efficient query execution.完成此操作后,您可以通过运行视频中提到的直接查询(20:00 以后)来实现增量加载,并实现资源高效的查询执行。

The final query will look something like -最终查询将类似于 -

Odbc.Query("dsn=Simba Athena", 
    "SELECT * FROM tablename 
    WHERE year >= " & DateTime.ToText(RangeStart, "yyyy") & "
AND month >= " & DateTime.ToText(RangeStart, "MM") & "
AND day >= " & DateTime.ToText(RangeStart, "dd") & "
AND year <= " & DateTime.ToText(RangeEnd, "yyyy") & "
AND month <= " & DateTime.ToText(RangeEnd, "MM") & "
AND day <= " & DateTime.ToText(RangeEnd, "dd") & "
")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM