I am trying to SUM amounts by a category, but there are duplicate amounts based on a reference number and I only want to include 1 amount per reference. There are about 100K different reference numbers, with 4 difference amount across the board.
The data I am analyzing look like this:
reference | category | amount | status
5574682 | cat1 | 45 | active
5574682 | cat1 | 45 | inactive
5574684 | cat1 | 95 | active
5574869 | cat2 | 65 | active
5574869 | cat2 | 65 | inactive
5574870 | cat2 | 55 | active
5574870 | cat2 | 55 | inactive
5574891 | cat3 | 95 | active
5574892 | cat3 | 45 | active
5574892 | cat3 | 45 | inactive
The below shows the correct result as a selection, but not the summed total by category
SELECT
a.reference,
c.category,
a.amount
FROM
table1_ref a
JOIN (
SELECT *
FROM
table_ref a
JOIN table_requests b ON a.transactionid = b.requestid
JOIN table_users c ON a.user_code = c.user_code
WHERE b.filename IN ('20190614','20190625','20190628')
) b ON a.reference = b.reference
JOIN table_users c ON a.user_code = c.user_code
WHERE
a.date BETWEEN '2019-08-01' AND '2019-08-31'
AND c.category IN (cat1, cat2, cat3)
GROUP BY
a.reference,
c.category;
With the above code I get results looking like this:
reference | category | amount
5574682 | cat1 | 45
5574684 | cat1 | 95
5574869 | cat2 | 65
5574870 | cat2 | 55
5574891 | cat3 | 95
5574892 | cat3 | 45
My expected result is as per below
cat1 | 140
cat2 | 120
cat3 | 140
UPDATED:
If you need to get results like this:
reference | category | amount | status
----------|----------|--------|---------
5574682 | cat1 | 45 | active
5574682 | cat1 | 45 | inactive
5574684 | cat1 | 95 | active
5574869 | cat2 | 65 | inactive -- Lines below
5574869 | cat2 | 65 | inactive -- would be impossible to get
5574870 | cat2 | 55 | inactive -- with GROUP BY, because
5574870 | cat2 | 55 | inactive -- `reference`, `category` and `status`
5574891 | cat3 | 95 | inactive -- are the same among pairs
5574892 | cat3 | 45 | inactive -- so they would be represented as one row
5574892 | cat3 | 45 | inactive -- with total amount
Then you have to use an aggregate SUM()
function and list an additional column in the outer column like this:
SELECT
a.reference,
c.category,
SUM(a.amount) as amount, -- CHANGED
SOMETABLE.status -- ADDED
FROM
table1_ref a
JOIN (
SELECT *
FROM
table_ref a
JOIN table_requests b ON a.transactionid = b.requestid
JOIN table_users c ON a.user_code = c.user_code
WHERE b.filename IN ('20190614','20190625','20190628')
) b ON a.reference = b.reference
JOIN table_users c ON a.user_code = c.user_code
WHERE
a.date BETWEEN '2019-08-01' AND '2019-08-31'
AND c.category IN (cat1, cat2, cat3)
GROUP BY
a.reference,
c.category,
SOMETABLE.status; -- ADDED
Since there are duplicates for each reference, you could use MAX aggregate function to get only 1 value per reference:
SELECT cat, SUM(amount) FROM
(SELECT MAX(`amount`) AS amount, `reference` AS ref, `category` AS cat
FROM data GROUP BY `reference`) AS T
GROUP BY cat
This works by:
If same reference numbers are shared between different categories, then change the GROUP BY clause to:
FROM data GROUP BY `reference`, `category`
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.