I'm planning a db-driven website that matches users based on how they answer questions. I'm thinking the best approach is to run the match calculations in the SELECT query, but I have no idea how to write the query.
Let say I have a table called user_answer and it looks like this:
+--------+-------------+--------+------------------+--------+
| userid | question_id | answer | preferred_answer | weight |
+--------+-------------+--------+------------------+--------+
| 1 | 20 | 3 | | 0 |
| 1 | 24 | 3 | 2, 3 | 1 |
| 1 | 36 | 2 | 2 | 10 |
| 1 | 37 | 3 | 1, 2, 3 | 50 |
| 1 | 40 | 3 | 3 | 250 |
| 2 | 20 | 3 | 3 | 10 |
| 2 | 24 | 3 | 2 | 1 |
| 2 | 25 | 2 | | 0 |
| 2 | 26 | 2 | | 0 |
| 2 | 40 | 3 | 2 | 250 |
+--------+-------------+--------+------------------+--------+
I want to select and order by match_percentage - match_percentage shoud be calculated this way:
I don't know if this is possible. I'm expecting the DB to grow to be very large, so loading them all and doing the calculations in PHP may not be the best choice - but correct me if I'm wrong.
Is it possible to make all these calculations in a query?
Yes, I believe all the specified calculations can be performed in a query.
Assuming that (userid, questionid) is UNIQUE, we start with finding userid with "matching" questions. We could do that with a query like this:
SELECT u.answer
, u.preferred_answer
, u.weight
, m.userid AS m_userid
, m.question_id AS m_question_id
, m.answer AS m_answer
, m.preferred_answer AS m_preferred_answer
, m.weight AS m_weight
FROM user_answer u
JOIN user_answer m
ON m.question_id = u.question_id
AND m.userid <> u.userid
AND u.userid = 1
ORDER
BY m.userid
, m.question_id
Once we have that working, we can work on getting the total weights and the calculations from those.
Assuming the preferred_answer
column is VARCHAR type, and contains a comma separated list of elements, with no spaces, eg '2'
, or '2,3,5'
, you could use the MySQL FIND_IN_SET
function to return the index position of a particular element within the list. And that will return 0 if a "match" is not found.
I believe this query meets the specification.
SELECT m.userid AS m_userid
, SUM(u.weight) AS total_weight1
, SUM(IF(FIND_IN_SET(m.answer,u.preferred_answer),u.weight,0)) AS match1_weight
, SUM(m.weight) AS total_weight2
, SUM(IF(FIND_IN_SET(u.answer,m.preferred_answer),m.weight,0)) AS match2_weight
, SQRT(
( SUM(IF(FIND_IN_SET(m.answer,u.preferred_answer),u.weight,0)) / SUM(u.weight) )
* ( SUM(IF(FIND_IN_SET(u.answer,m.preferred_answer),m.weight,0)) / SUM(m.weight) )
) AS match_percentage
FROM user_answer u
JOIN user_answer m
ON m.question_id = u.question_id
AND m.userid <> u.userid
AND u.userid = 1
GROUP
BY m.userid
ORDER
BY match_percentage DESC
NOTE:
These queries are desk checked only. I didn't set up a SQL Fiddle to test.
Item 4 appears to be a total of current_user weight , but only including matching answers. If there are no matching answer, we're going to return 0. Same for item 6, but just inverse.)
If there are no matching questions between userid 1 and some other userid, then no row will be returned for the other userid.
For a large set, this could potentially crank for a while. Suitable covering indexes should improve performance.
For improved query performance, you may want to consider "caching" the result of this query into a separate table. The contents of the "cache" table would only need to be refreshed if a row in the original table was inserted, updated, deleted. And the previously calculated results might still be "good enough" for normal access.
If you stored the results, you'd want to also return u.userid
as a column in the SELECT list and GROUP BY.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.