简体   繁体   中英

Count how many times a word is being used per day

I have a MySQL table named "content"containing (ao) the fields "_date" and "text", for example:

_date      text
---------------------------------------------------------
2011-02-18 I'm afraid my car won't start tomorrow
2011-02-18 I hope I'm going to pass my exams
2011-02-18 Exams coming up - I'm not afraid :P
2011-02-19 Not a single f was given this day
2011-02-20 I still hope I passed, but I'm afraid I didn't
2011-02-20 On my way to school :)

I'm looking for a query to count the number of times the words "hope" and "afraid" are being used per day. In other words, the output would have to be something like:

_date      word   count
-----------------------
2011-02-18 hope   1
2011-02-18 afraid 2
2011-02-19 hope   0
2011-02-19 afraid 0
2011-02-20 hope   1
2011-02-20 afraid 1

Is there an easy way to do this or should I just write I different query per term? I now have this, but I don't know what to put instead of "?"

SELECT COUNT(?) FROM content WHERE text LIKE '%hope' GROUP BY _date

Can somebody help met with the correct query for this?

I think the most easy and redable way is to make subquerys:

 Select 
    _date, 'hope' as word, 
    sum( case when `text` like '%hope%' then 1 else 0 end) as n
 from content
 group by _date
 UNION
 Select 
    _date, 'afraid' as word, 
    sum( case when `text` like '%afraid%' then 1 else 0 end) as n
 from content
 group by _date

This approach has not the best performace. If you are looking for performance you should grouping in subquery by day, also this like condition is a performance killer. This is a solution if you only execute the query in batch mode time by time. Explain your performance requeriments for an accurate solution.

EDITED TO MATCH LAST OP REQUERIMENT

Your query is almost correct:

SELECT _date, 'hope' AS word, COUNT(*) as count
FROM content WHERE text LIKE '%hope%' GROUP BY _date

use %hope% to match the word anywhere (not only at the end of the string). COUNT(*) should do what you want.

To get multiple words from a single query, use UNION ALL


Another approach is to create a sequence of words on the fly and use it as the second table in a join:

SELECT _date, words.word, COUNT(*) as count
FROM (
   SELECT 'hope' AS word
   UNION
   SELECT 'afraid' AS word
) AS words
CROSS JOIN content
WHERE text LIKE CONCAT('%', words.word, '%')
GROUP BY _date, words.word

Note that it will only count a single occurrence of each word per sentence. So »I hope there is still hope« will only give you 1 , and not 2


To get 0 when there are no matches, join the previous result with the dates again:

SELECT content._date, COALESCE(result.word, 'no match'), COALESCE(result.count, 0)
FROM content
LEFT JOIN (
SELECT _date, words.word, COUNT(*) as count
FROM (
   SELECT 'hope' AS word
   UNION
   SELECT 'afraid' AS word
) AS words
CROSS JOIN content
WHERE text LIKE CONCAT('%', words.word, '%')
GROUP BY _date, words.word ) AS result
ON content._date = result._date

Assuming you want to count all words and find the most used words (rather than looking for the count of a few specific words) you might want to try something like the following stored procedure (string splitting compliments of this blog post ):

DROP PROCEDURE IF EXISTS wordsUsed;
DELIMITER //
CREATE PROCEDURE wordsUsed ()
BEGIN
    DROP TEMPORARY TABLE IF EXISTS wordTmp;
    CREATE TEMPORARY TABLE wordTmp (word VARCHAR(255));

    SET @wordCt  = 0;
    SET @tokenCt = 1;

    contentLoop: LOOP
        SET @stmt = 'INSERT INTO wordTmp SELECT REPLACE(SUBSTRING(SUBSTRING_INDEX(`text`, " ", ?),
                                LENGTH(SUBSTRING_INDEX(`text`, " ", ? -1)) + 1),
                                " ", "") word
                     FROM content
                     WHERE LENGTH(SUBSTRING_INDEX(`text`, " ", ?)) != LENGTH(`text`)';
        PREPARE cmd FROM @stmt;
        EXECUTE cmd USING @tokenCt, @tokenCt, @tokenCt;
        SELECT ROW_COUNT() INTO @wordCt;
        DEALLOCATE PREPARE cmd;
        IF (@wordCt = 0) THEN
            LEAVE contentLoop;
        ELSE
            SET @tokenCt = @tokenCt + 1;
        END IF;
    END LOOP;

    SELECT word, count(*) usageCount FROM wordTmp GROUP BY word ORDER BY usageCount DESC;
END //
DELIMITER ;

CALL wordsUsed();

You might want to write another query (or procedure) or add some nested "REPLACE" statements to further remove punctuation from the resulting temp table of words, but this should be a good start.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM