如何計算從一個表到另一個表中的注釋的單詞的出現次數

Question

我正在嘗試在Google的BigQuery中完成一項任務，這可能需要邏輯我不確定SQL可以本地處理。

我有2張桌子：

第一個表有一列，每行是一個小寫字
第二個表是一個評論數據庫（包括誰發表評論，評論本身，時間戳等數據）

我想根據第一個表中單詞的出現次數對第二個表中的注釋進行排序。

這是我想要做的基本示例，使用python，使用字母而不是單詞......但是你明白了：

words = ['a','b','c','d','e']

comments = ['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']

wordcount = {}

for comment in comments:
    for word in words:
        if word in comment:
            if comment in wordcount:
                wordcount[comment] += 1
            else:
                wordcount[comment] = 1

print(sorted(wordcount.items(), key = lambda k: k[1], reverse=True))

輸出：

[('look another sentence, which is also a comment', 3), ('this is another comment', 3), ('this is the first sentence', 2), ('nope', 1)]

到目前為止，我已經看到生成SQL查詢的最好的事情是執行以下操作：

SELECT
    COUNT(*)
FROM
    table
WHERE
    comment_col like '%word1%'
    OR comment_col like '%word2%'
    OR ...

但是有超過2000個單詞......它感覺不對。 有小費嗎？

Answer 1

以下是BigQuery Standard SQL

#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0 
GROUP BY comment
-- ORDER BY cnt DESC

作為選項，如果您願意，可以使用regexp：

#standardSQL
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, word)
GROUP BY comment
-- ORDER BY cnt DESC

您可以使用問題中的虛擬示例來測試/播放上面的內容

#standardSQL
WITH words AS (
  SELECT word
  FROM UNNEST(['a','b','c','d','e']) word
),
comments AS (
  SELECT comment 
  FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON STRPOS(comment, word) > 0 
GROUP BY comment
ORDER BY cnt DESC

更新：

有任何快速建議只能進行完整的字符串匹配嗎？

#standardSQL
WITH words AS (
  SELECT word
  FROM UNNEST(['a','no','is','d','e']) word
),
comments AS (
  SELECT comment 
  FROM UNNEST(['this is the first sentence', 'this is another comment', 'look another sentence, which is also a comment', 'nope', 'no', 'run']) comment
)
SELECT comment, COUNT(word) AS cnt
FROM comments
JOIN words
ON REGEXP_CONTAINS(comment, CONCAT(r'\b', word, r'\b')) 
GROUP BY comment
ORDER BY cnt DESC

Answer 2

如果我理解得很好，我認為你需要這樣的查詢：

select comment, count(*) cnt
from comments
join words
  on comment like '% ' + word + ' %'   --this checks for `... word ..`; a word between spaces
  or comment like word + ' %'          --this checks for `word ..`; a word at the start of comment
  or comment like '% ' + word          --this checks for `.. word`; a word at the end of comment
  or comment = word                    --this checks for `word`; whole comment is the word
group by comment
order by count(*) desc

SQL Server Fiddle Demo作為示例

如何計算從一個表到另一個表中的注釋的單詞的出現次數

問題描述

2 個解決方案

解決方案1
2 已采納 2017-10-22 14:54:39

解決方案2
1 2017-10-22 09:47:20

如何計算從一個表到另一個表中的注釋的單詞的出現次數

問題描述

2 個解決方案

解決方案1 2 已采納 2017-10-22 14:54:39

解決方案2 1 2017-10-22 09:47:20

解決方案1
2 已采納 2017-10-22 14:54:39

解決方案2
1 2017-10-22 09:47:20