简体   繁体   English

雅典娜/普雷斯托 | 自加入时无法匹配 ID 行

[英]Athena/Presto | Can't match ID row on self join

I'm trying to get the bi-grams on a string column.我正在尝试在字符串列上获取二元组。

I've followed the approach here but Athena/Presto is giving me errors at the final steps.我已经遵循了这里的方法,但是 Athena/Presto 在最后的步骤中给了我错误。

Source code so far到目前为止的源代码

with word_list as (
    SELECT 
      transaction_id, 
      words, 
      n, 
      regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)') as f70,
      f70_remittance_info
    FROM exploration_transaction
    cross join unnest(regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)')) with ordinality AS t (words, n)
    where cardinality((regexp_extract_all(f70_remittance_info, '([a-zA-Z]+)'))) > 1
    and f70_remittance_info is not null
    limit 50 )
select wl1.f70, wl1.n, wl1.words, wl2.f70, wl2.n, wl2.words
from word_list wl1
join word_list wl2 
on wl1.transaction_id = wl2.transaction_id

The specific issue I'm having is on the very last line, when I try to self join the transaction ids - it always returns zero rows.我遇到的具体问题是在最后一行,当我尝试自行加入事务 ID 时 - 它总是返回零行。 It does work if I join only by wl1.n = wl2.n-1 (the position on the array) which is useless if I can't constrain it to a same id.如果我只通过wl1.n = wl2.n-1 (阵列上的 position)加入,它确实有效,如果我不能将它限制为相同的 ID,这将是无用的。

Athena doesn't support the ngrams function by presto, so I'm left with this approach. Athena 不支持 presto 的 ngrams function,所以我只剩下这种方法了。

Any clues why this isn't working?任何线索为什么这不起作用? Thanks!谢谢!

This is speculation.这是猜测。 But I note that your CTE is using limit with no order by .但我注意到您的 CTE 使用的是没有order by limit That means that an arbitrary set of rows is being returned.这意味着正在返回任意一组行。

Although some databases materialize CTEs, many do not.尽管一些数据库实现了 CTE,但许多数据库没有。 They run the code independently each time it is referenced.每次引用代码时,它们都会独立运行代码。 My guess is that the code is run independently and the arbitrary set of 50 rows has no transaction ids in common.我的猜测是代码是独立运行的,任意 50 行的集合没有共同的事务 ID。

One solution would be to add order by transacdtion_id in the subquery.一种解决方案是在子查询中order by transacdtion_id

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM