简体   繁体   English

如何在Presto / AWS Athena上进行内部联接?

[英]How do inner join on Presto/AWS Athena?

I am trying to execute a query to select all rows that fall within ranges defined by start and end columns in another table. 我正在尝试执行查询以选择落入另一个表的开始和结束列定义的范围内的所有行。 For example, with pseudo-code, if I had these (very small) tables: 例如,使用伪代码,如果我有以下(非常小的)表:

ranges:
    group_id = c("a", "b", "c", "d"),
    start = c(1, 7, 2, 25),
    end = c(5, 23, 7, 29)

positions:
    position = 100 random numbers
    annotation = 100 random strings

I would like to make a query that would return something like: 我想进行查询,返回的内容如下:

group_id  position  annotation
a         2         adfkjdas
a         3         sdlfkjasl;kdfj
b         9         sdlfkdj
c         5         wwlekrj
d         27        zxcvzx

Using MariaDB/MySQL, a BETWEEN query will operate row-wise over ranges, so this would work: 使用MariaDB / MySQL,BETWEEN查询将在范围内按行操作,因此可以运行:

SELECT
      ranges.group_id as group_id,
      positions.position as position,
      positions.annotation as annotation
    FROM
      (SELECT * FROM my_ranges) AS ranges, positions
    WHERE
      positions.position BETWEEN ranges.start AND ranges.end

That is, the query acts as if the WHERE clause is actually a series of WHERE clauses joined by "OR", one for each row of the ranges table (eg BETWEEN 1 AND 5 OR BETWEEN 7 and 23 OR BETWEEN 2 AND 7 OR BETWEEN 25 AND 29). 也就是说,查询的行为就好像WHERE子句实际上是由“ OR”连接的一系列WHERE子句,范围表的每一行都对应一个(例如1和5或7和23之间或2和7之间以及2和7之间或之间25和29)。

It seems like the BETWEEN operator behaves differently in presto, so the same query does not return any results. 看起来BETWEEN运算符的行为方式有所不同,因此同一查询不会返回任何结果。

In realty, my ranges table has ~20,000 ranges I'd like to query, so joining them by writing OR statements seems prohibitive... 实际上,我的范围表有我要查询的大约20,000个范围,因此通过编写OR语句将它们连接起来似乎令人望而却步...

Can anyone here suggest a way to modify this query (or my general approach!) to work with Presto? 这里有人可以建议一种方法来修改此查询(或我的一般方法!)以与Presto一起使用吗?

(added in response to comment): For more SQL than pseudo-code, I'd like to (为回应评论而添加):对于比伪代码更多的SQL,我想

use tables like this:
CREATE TABLE IF NOT EXISTS `ranges` (
  `group_id` char,
  `start` int(3),
  `end` int(3)
);

INSERT INTO `ranges` (`group_id`, `start`, `end`) VALUES
  ('a', '2', '5'),
  ('b', '7', '23'),
  ('c', '2', '7'),
  ('d', '25', '29');

CREATE TABLE IF NOT EXISTS `positions` (
  `position` int(3),
  `annotation` varchar(20)
);

INSERT INTO `positions` (`position`, `annotation`) VALUES
  ('2', 'adfkjdas'),
  ('3', 'sdlfkjasl;kdfj'),
  ('5', 'wwlekrj'),
  ('9', 'sdlfkdj'),
  ('27', 'zxcvzx');

And run queries like this: 并运行如下查询:

SELECT
  group_id,
  position,
  annotation
FROM
  ranges, positions
WHERE
  positions.position BETWEEN ranges.start AND ranges.end

The following worked for me. 以下对我有用。 I had to workaround the fact that end is a reserved word: 我必须解决以下事实,即end是保留字:

CREATE EXTERNAL TABLE IF NOT EXISTS ranges ( 
  group_id string,
  start_value int,
  end_value int
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://my-bucket/ranges/';

CREATE EXTERNAL TABLE IF NOT EXISTS positions ( 
  position int,
  annotation string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://my-bucket/positions/';

SELECT
  group_id,
  position,
  annotation
FROM
  ranges, positions
WHERE
  positions.position BETWEEN ranges.start_value AND ranges.end_value;

The ranges and positions directories contained CSV files: rangespositions目录包含CSV文件:

a,2,5
b,7,23
c,2,7
d,25,29

and

2,adfkjdas
3,sdlfkjaslkdfj
5,wwlekrj
9,sdlfkdj
27,zxcvzx

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM