简体   繁体   English

计算每天使用一个单词的次数

[英]Count how many times a word is being used per day

I have a MySQL table named "content"containing (ao) the fields "_date" and "text", for example: 我有一个名为“content”的MySQL表,其中包含(ao)字段“_date”和“text”,例如:

_date      text
---------------------------------------------------------
2011-02-18 I'm afraid my car won't start tomorrow
2011-02-18 I hope I'm going to pass my exams
2011-02-18 Exams coming up - I'm not afraid :P
2011-02-19 Not a single f was given this day
2011-02-20 I still hope I passed, but I'm afraid I didn't
2011-02-20 On my way to school :)

I'm looking for a query to count the number of times the words "hope" and "afraid" are being used per day. 我正在寻找一个查询来计算每天使用“希望”和“害怕”字样的次数。 In other words, the output would have to be something like: 换句话说,输出必须是这样的:

_date      word   count
-----------------------
2011-02-18 hope   1
2011-02-18 afraid 2
2011-02-19 hope   0
2011-02-19 afraid 0
2011-02-20 hope   1
2011-02-20 afraid 1

Is there an easy way to do this or should I just write I different query per term? 有没有一种简单的方法可以做到这一点,还是我应该在每个学期写出不同的查询? I now have this, but I don't know what to put instead of "?" 我现在有这个,但我不知道该放什么而不是“?”

SELECT COUNT(?) FROM content WHERE text LIKE '%hope' GROUP BY _date

Can somebody help met with the correct query for this? 有人可以帮助满足正确的查询吗?

I think the most easy and redable way is to make subquerys: 我认为最简单易行的方法是制作subquerys:

 Select 
    _date, 'hope' as word, 
    sum( case when `text` like '%hope%' then 1 else 0 end) as n
 from content
 group by _date
 UNION
 Select 
    _date, 'afraid' as word, 
    sum( case when `text` like '%afraid%' then 1 else 0 end) as n
 from content
 group by _date

This approach has not the best performace. 这种方法没有最好的性能。 If you are looking for performance you should grouping in subquery by day, also this like condition is a performance killer. 如果你正在寻找性能,你应该在白天对子查询进行分组,这也like性能杀手一样。 This is a solution if you only execute the query in batch mode time by time. 如果您只是按批处理模式执行查询,那么这是一个解决方案。 Explain your performance requeriments for an accurate solution. 解释您的性能要求以获得准确的解决方案。

EDITED TO MATCH LAST OP REQUERIMENT 编辑匹配最后的OP要求

Your query is almost correct: 您的查询几乎正确:

SELECT _date, 'hope' AS word, COUNT(*) as count
FROM content WHERE text LIKE '%hope%' GROUP BY _date

use %hope% to match the word anywhere (not only at the end of the string). 使用%hope%匹配任何地方的单词(不仅仅是在字符串的末尾)。 COUNT(*) should do what you want. COUNT(*)应该做你想要的。

To get multiple words from a single query, use UNION ALL 要从单个查询中获取多个单词,请使用UNION ALL


Another approach is to create a sequence of words on the fly and use it as the second table in a join: 另一种方法是动态创建一系列单词并将其用作连接中的第二个表:

SELECT _date, words.word, COUNT(*) as count
FROM (
   SELECT 'hope' AS word
   UNION
   SELECT 'afraid' AS word
) AS words
CROSS JOIN content
WHERE text LIKE CONCAT('%', words.word, '%')
GROUP BY _date, words.word

Note that it will only count a single occurrence of each word per sentence. 请注意,每个句子只会计算每个单词的一次出现次数。 So »I hope there is still hope« will only give you 1 , and not 2 所以»我希望仍有希望«只会给你1 ,而不是2


To get 0 when there are no matches, join the previous result with the dates again: 要在没有匹配项时获得0 ,请再次将上一个结果与日期相关联:

SELECT content._date, COALESCE(result.word, 'no match'), COALESCE(result.count, 0)
FROM content
LEFT JOIN (
SELECT _date, words.word, COUNT(*) as count
FROM (
   SELECT 'hope' AS word
   UNION
   SELECT 'afraid' AS word
) AS words
CROSS JOIN content
WHERE text LIKE CONCAT('%', words.word, '%')
GROUP BY _date, words.word ) AS result
ON content._date = result._date

Assuming you want to count all words and find the most used words (rather than looking for the count of a few specific words) you might want to try something like the following stored procedure (string splitting compliments of this blog post ): 假设您想要计算所有单词并找到最常用的单词(而不是查找几个特定单词的计数),您可能需要尝试类似以下存储过程( 此博客文章的字符串拆分补充):

DROP PROCEDURE IF EXISTS wordsUsed;
DELIMITER //
CREATE PROCEDURE wordsUsed ()
BEGIN
    DROP TEMPORARY TABLE IF EXISTS wordTmp;
    CREATE TEMPORARY TABLE wordTmp (word VARCHAR(255));

    SET @wordCt  = 0;
    SET @tokenCt = 1;

    contentLoop: LOOP
        SET @stmt = 'INSERT INTO wordTmp SELECT REPLACE(SUBSTRING(SUBSTRING_INDEX(`text`, " ", ?),
                                LENGTH(SUBSTRING_INDEX(`text`, " ", ? -1)) + 1),
                                " ", "") word
                     FROM content
                     WHERE LENGTH(SUBSTRING_INDEX(`text`, " ", ?)) != LENGTH(`text`)';
        PREPARE cmd FROM @stmt;
        EXECUTE cmd USING @tokenCt, @tokenCt, @tokenCt;
        SELECT ROW_COUNT() INTO @wordCt;
        DEALLOCATE PREPARE cmd;
        IF (@wordCt = 0) THEN
            LEAVE contentLoop;
        ELSE
            SET @tokenCt = @tokenCt + 1;
        END IF;
    END LOOP;

    SELECT word, count(*) usageCount FROM wordTmp GROUP BY word ORDER BY usageCount DESC;
END //
DELIMITER ;

CALL wordsUsed();

You might want to write another query (or procedure) or add some nested "REPLACE" statements to further remove punctuation from the resulting temp table of words, but this should be a good start. 您可能想要编写另一个查询(或过程)或添加一些嵌套的“REPLACE”语句,以进一步从生成的临时表中删除标点符号,但这应该是一个好的开始。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM