Count how many times a word is being used per day

Question

I have a MySQL table named "content"containing (ao) the fields "_date" and "text", for example:

_date      text
---------------------------------------------------------
2011-02-18 I'm afraid my car won't start tomorrow
2011-02-18 I hope I'm going to pass my exams
2011-02-18 Exams coming up - I'm not afraid :P
2011-02-19 Not a single f was given this day
2011-02-20 I still hope I passed, but I'm afraid I didn't
2011-02-20 On my way to school :)

I'm looking for a query to count the number of times the words "hope" and "afraid" are being used per day. In other words, the output would have to be something like:

_date      word   count
-----------------------
2011-02-18 hope   1
2011-02-18 afraid 2
2011-02-19 hope   0
2011-02-19 afraid 0
2011-02-20 hope   1
2011-02-20 afraid 1

Is there an easy way to do this or should I just write I different query per term? I now have this, but I don't know what to put instead of "?"

SELECT COUNT(?) FROM content WHERE text LIKE '%hope' GROUP BY _date

Can somebody help met with the correct query for this?

Answer 1

I think the most easy and redable way is to make subquerys:

 Select 
    _date, 'hope' as word, 
    sum( case when `text` like '%hope%' then 1 else 0 end) as n
 from content
 group by _date
 UNION
 Select 
    _date, 'afraid' as word, 
    sum( case when `text` like '%afraid%' then 1 else 0 end) as n
 from content
 group by _date

This approach has not the best performace. If you are looking for performance you should grouping in subquery by day, also this like condition is a performance killer. This is a solution if you only execute the query in batch mode time by time. Explain your performance requeriments for an accurate solution.

EDITED TO MATCH LAST OP REQUERIMENT

Answer 2

Your query is almost correct:

SELECT _date, 'hope' AS word, COUNT(*) as count
FROM content WHERE text LIKE '%hope%' GROUP BY _date

use %hope% to match the word anywhere (not only at the end of the string). COUNT(*) should do what you want.

To get multiple words from a single query, use UNION ALL

Another approach is to create a sequence of words on the fly and use it as the second table in a join:

SELECT _date, words.word, COUNT(*) as count
FROM (
   SELECT 'hope' AS word
   UNION
   SELECT 'afraid' AS word
) AS words
CROSS JOIN content
WHERE text LIKE CONCAT('%', words.word, '%')
GROUP BY _date, words.word

Note that it will only count a single occurrence of each word per sentence. So »I hope there is still hope« will only give you 1 , and not 2

To get 0 when there are no matches, join the previous result with the dates again:

SELECT content._date, COALESCE(result.word, 'no match'), COALESCE(result.count, 0)
FROM content
LEFT JOIN (
SELECT _date, words.word, COUNT(*) as count
FROM (
   SELECT 'hope' AS word
   UNION
   SELECT 'afraid' AS word
) AS words
CROSS JOIN content
WHERE text LIKE CONCAT('%', words.word, '%')
GROUP BY _date, words.word ) AS result
ON content._date = result._date

Answer 3

Assuming you want to count all words and find the most used words (rather than looking for the count of a few specific words) you might want to try something like the following stored procedure (string splitting compliments of this blog post ):

DROP PROCEDURE IF EXISTS wordsUsed;
DELIMITER //
CREATE PROCEDURE wordsUsed ()
BEGIN
    DROP TEMPORARY TABLE IF EXISTS wordTmp;
    CREATE TEMPORARY TABLE wordTmp (word VARCHAR(255));

    SET @wordCt  = 0;
    SET @tokenCt = 1;

    contentLoop: LOOP
        SET @stmt = 'INSERT INTO wordTmp SELECT REPLACE(SUBSTRING(SUBSTRING_INDEX(`text`, " ", ?),
                                LENGTH(SUBSTRING_INDEX(`text`, " ", ? -1)) + 1),
                                " ", "") word
                     FROM content
                     WHERE LENGTH(SUBSTRING_INDEX(`text`, " ", ?)) != LENGTH(`text`)';
        PREPARE cmd FROM @stmt;
        EXECUTE cmd USING @tokenCt, @tokenCt, @tokenCt;
        SELECT ROW_COUNT() INTO @wordCt;
        DEALLOCATE PREPARE cmd;
        IF (@wordCt = 0) THEN
            LEAVE contentLoop;
        ELSE
            SET @tokenCt = @tokenCt + 1;
        END IF;
    END LOOP;

    SELECT word, count(*) usageCount FROM wordTmp GROUP BY word ORDER BY usageCount DESC;
END //
DELIMITER ;

CALL wordsUsed();

You might want to write another query (or procedure) or add some nested "REPLACE" statements to further remove punctuation from the resulting temp table of words, but this should be a good start.

Count how many times a word is being used per day

Question

3 answers

solution1
3 ACCPTED 2012-02-04 16:56:37

solution2
2 2012-02-04 16:56:12

solution3
2 2012-02-04 21:49:14

Count how many times a word is being used per day

Question

3 answers

solution1 3 ACCPTED 2012-02-04 16:56:37

solution2 2 2012-02-04 16:56:12

solution3 2 2012-02-04 21:49:14

solution1
3 ACCPTED 2012-02-04 16:56:37

solution2
2 2012-02-04 16:56:12

solution3
2 2012-02-04 21:49:14