简体   繁体   中英

How to create pairs from the table

I have a Hive table wherein data looks like this -

在此处输入图片说明

Each customer has corresponding accounts and the objective is to make intra-customer pair. Pairs are based on whether the accounts have same year of birth or their first 3 characters of name are same. Eg - Sam and Samuel.

The output looks like this - 在此处输入图片说明

Ideally same account pair like AA, XX etc should not get created. Also a pair AC and CA are both same hence only one entry of such pairs is needed. A pair can be formed on Name as well Year of Birth key but here also only one entry is required (can be anyone).

How should I approach this problem. Test data for check -

create table customer_account(
customer INT NOT NULL,
accounts VARCHAR(100) NOT NULL,
name VARCHAR(40) NOT NULL,
yob DATE,
);

INSERT INTO 
customer_account(customer,accounts,name,yob)
VALUES
(1,"A","John",2001),
(1,"X","Tom",1996),
(1,"C","Harry",2001),
(2,"D","Sam",1994),
(2,"F","Samuel",1995),
(3,"Z","Jake",)1994,
(3,"G","Drake",1998),
(3,"H","Arnold",1993),
(3,"K","Yang",1990)
;

You should be able to use substrings for your join in the HIVE language. The logic should be sound though you may need to tune it for your needs a bit.

What you're trying to do is a unary (or self) join. Below is an example of a type of query that can be passed. You're essentially joining with an OR condition and testing that condition with a case statement to get the "Pair_Key". I used an inner join assuming you want only instances where matches occur.

SELECT 
     t1.customer as Customer1,
     t2.customer as Customer2,
     t1.Accounts as Accounts1,
     t2.Accounts as Accounts2,
     CONCAT(t1.Accounts, t2.Accounts) as Pair_No,
     t1.Name as Name1,
     t2.Name as Name2,
     t1.YOB as YOB1,
     t2.YOB as YOB2,
     CASE
     WHEN t1.YOB = t2.YOB THEN 'YOB'
     WHEN SUBSTR(t1.Name, 3) = SUBSTR(t2.Name, 3) THEN 'Name'
     else 'Issue'
     END as Pair_Key
FROM (SELECT * FROM Table1) as t1
inner join (SELECT * FROM Table1) as t2 --instance 2 of the same table
on (SUBSTR(t1.Name, 3) = SUBSTR(t2.Name, 3) OR t1.YOB = t2.YOB)

Without test data or more details of how far along you are, this is a start.

If the customer number needs to be the same simply adjust to:

on (t1.Customer = t2.Customer) and (SUBSTR(t1.Name, 3) = SUBSTR(t2.Name, 3) OR t1.YOB = t2.YOB)

This does what you describe:

select t1.*, t2.name, t2.yob
from t t1 join
     t t2
     on t2.customer = t1.customer and
        (t2.yob = t1.yob or
         substr(t2.name, 1, 3) = substr(t1.name, 1, 3)
        ) and
        t2.account > t1.account;

There is no need to fetch customer twice. If you want "identical" pairs, then change the last condition to >= .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM