简体   繁体   中英

Select distinct records and at the same time first occurance of the record from MySQL

I am not familiar enough with MySQL execution plan so I need help to understand and find out how to operate on a subsets of data in MySQL if it is possible. I have two tables:

Table users:

+-----------------+-------------+------+-----+---------+----------------+
| Field           | Type        | Null | Key | Default | Extra          |
+-----------------+-------------+------+-----+---------+----------------+
| user_id         | int(11)     | NO   | PRI | NULL    | auto_increment |
| msisdn          | bigint(20)  | NO   | UNI | NULL    |                |
| activation_date | datetime    | NO   |     | NULL    |                |
| msisdn_type     | varchar(32) | NO   |     | NULL    |                |
+-----------------+-------------+------+-----+---------+----------------+

Table log_archive:

+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| msisdn      | bigint(11)   | NO   | MUL | NULL    |       |
| msisdn_type | varchar(32)  | NO   |     | NULL    |       |
| date        | date         | NO   |     | NULL    |       |
| action      | varchar(32)  | NO   |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+ 

In table users msisdn is unique but in log_archive it is not.

Here you can find PHP script that will generate you test data for these two tables:

Test data generation script helper

I need to select:

1) All distinct records by msisdn from table log_archive;
2) By earliest date per msisdn for one specific action only;
3) For a specific date range from table log_archive;
4) And to join activation_date from users table with msisdn from both tables.

Let me give you example. Let's say that this is sample data from log_archive table:

+--------------+------------+---------------------+----------------+
|    msisdn    |    date    |   activation_date   |     action     |
|--------------+------------+---------------------+----------------+
| 977129764170 | 2016-02-11 | 2014-10-07 00:00:00 | all_services   |
| 977129764170 | 2015-09-05 | 2014-10-07 00:00:00 | app_start      |
| 977129764170 | 2015-05-08 | 2014-10-07 00:00:00 | widget         |
| 986629508626 | 2015-07-12 | 2016-02-05 00:00:00 | app_start      |
| 986629508626 | 2015-03-02 | 2016-02-05 00:00:00 | number_connect |
| 986629508626 | 2015-05-08 | 2016-02-05 00:00:00 | widget         |
| 986629508626 | 2015-01-08 | 2016-02-05 00:00:00 | app_start      |
| 933563888440 | 2016-02-20 | 2014-10-06 00:00:00 | all_services   |
| 933563888440 | 2015-03-12 | 2014-10-06 00:00:00 | app_start      |
| 933563888440 | 2015-04-26 | 2014-10-06 00:00:00 | number_connect |
| 933563888440 | 2015-10-17 | 2014-10-06 00:00:00 | all_services   |
| 943730853721 | 2015-06-19 | 2015-05-01 00:00:00 | widget         |
| 943730853721 | 2015-12-08 | 2015-05-01 00:00:00 | app_start      |
| 943730853721 | 2016-02-09 | 2015-05-01 00:00:00 | app_start      |
+--------------+------------+---------------------+----------------+

So distinct msisdns here are 977129764170, 986629508626, 933563888440, 943730853721;

Earliest date for distinct msisdn value where action column equals 'app_start' is:

977129764170 is 2015-09-05
986629508626 is 2015-01-08
933563888440 is 2015-03-12
943730853721 is 2015-06-19

I need to make such SQL that will give me this output:

+--------------+------------+---------------------+----------------+
|    msisdn    |    date    |   activation_date   |     action     |
|--------------+------------+---------------------+----------------+
| 977129764170 | 2015-09-05 | 2014-10-07 00:00:00 | app_start      |
| 986629508626 | 2015-01-08 | 2016-02-05 00:00:00 | app_start      |
| 933563888440 | 2015-03-12 | 2014-10-06 00:00:00 | app_start      |
| 943730853721 | 2015-12-08 | 2015-05-01 00:00:00 | app_start      |
+--------------+------------+---------------------+----------------+

So I need to select all distinct msisdns for earliest date where app_start action occurs and join activation_date from users table by that distinct msisd. And to only look for specific date range from date column.

I tried it with this sql with no results:

SELECT DISTINCT(log_archive.msisdn) as msisdn, DATE(log_archive.date) AS actionDate, users.activation_date

FROM log_archive 

INNER JOIN users on log_archive.msisdn = users.msisdn

WHERE log_archive.action = 'app_start' && log_archive.date BETWEEN '2015-01-08' AND '2016-03-15'

ORDER BY actionDate ASC;

I get the same msisdn more than once even though I used DISTINCT.

Do I need to use subquery?

You'll want a GROUP BY to get MIN(date) for each msisdn;

SELECT msisdn, MIN(date) date, MIN(action) action 
FROM log_archive 
WHERE action='app_start' 
  AND date BETWEEN '2015-01-08' AND '2016-03-15' 
GROUP BY msisdn

We also add a MIN(action) since we should aggregate each field that isn't grouped by and since action is the same for all selected rows, MIN works well.

Once you have that, adding the join is pretty straight forward;

SELECT a.msisdn, MIN(a.date) date, u.activation_date, MIN(a.action) action 
FROM log_archive a
JOIN users u
  ON u.msisdn = a.msisdn
WHERE a.action='app_start' 
  AND a.date BETWEEN '2015-01-08' AND '2016-03-15' 
GROUP BY a.msisdn

SDISTINCT look on all returned columns, hence distinct row of returned data. So if you want only distinct rows from log_archive use it in a subquery before joining. Like:

(SELECT DISTINCT * FROM log_archive) AS distinct_Log INNER JOIN...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM