I have a Hive table wherein data looks like this -
Each customer has corresponding accounts and the objective is to make intra-customer pair. Pairs are based on whether the accounts have same year of birth or their first 3 characters of name are same. Eg - Sam and Samuel.
Ideally same account pair like AA, XX etc should not get created. Also a pair AC and CA are both same hence only one entry of such pairs is needed. A pair can be formed on Name as well Year of Birth key but here also only one entry is required (can be anyone).
How should I approach this problem. Test data for check -
create table customer_account(
customer INT NOT NULL,
accounts VARCHAR(100) NOT NULL,
name VARCHAR(40) NOT NULL,
yob DATE,
);
INSERT INTO
customer_account(customer,accounts,name,yob)
VALUES
(1,"A","John",2001),
(1,"X","Tom",1996),
(1,"C","Harry",2001),
(2,"D","Sam",1994),
(2,"F","Samuel",1995),
(3,"Z","Jake",)1994,
(3,"G","Drake",1998),
(3,"H","Arnold",1993),
(3,"K","Yang",1990)
;
You should be able to use substrings for your join in the HIVE language. The logic should be sound though you may need to tune it for your needs a bit.
What you're trying to do is a unary (or self) join. Below is an example of a type of query that can be passed. You're essentially joining with an OR condition and testing that condition with a case statement to get the "Pair_Key". I used an inner join assuming you want only instances where matches occur.
SELECT
t1.customer as Customer1,
t2.customer as Customer2,
t1.Accounts as Accounts1,
t2.Accounts as Accounts2,
CONCAT(t1.Accounts, t2.Accounts) as Pair_No,
t1.Name as Name1,
t2.Name as Name2,
t1.YOB as YOB1,
t2.YOB as YOB2,
CASE
WHEN t1.YOB = t2.YOB THEN 'YOB'
WHEN SUBSTR(t1.Name, 3) = SUBSTR(t2.Name, 3) THEN 'Name'
else 'Issue'
END as Pair_Key
FROM (SELECT * FROM Table1) as t1
inner join (SELECT * FROM Table1) as t2 --instance 2 of the same table
on (SUBSTR(t1.Name, 3) = SUBSTR(t2.Name, 3) OR t1.YOB = t2.YOB)
Without test data or more details of how far along you are, this is a start.
If the customer number needs to be the same simply adjust to:
on (t1.Customer = t2.Customer) and (SUBSTR(t1.Name, 3) = SUBSTR(t2.Name, 3) OR t1.YOB = t2.YOB)
This does what you describe:
select t1.*, t2.name, t2.yob
from t t1 join
t t2
on t2.customer = t1.customer and
(t2.yob = t1.yob or
substr(t2.name, 1, 3) = substr(t1.name, 1, 3)
) and
t2.account > t1.account;
There is no need to fetch customer
twice. If you want "identical" pairs, then change the last condition to >=
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.