简体   繁体   中英

Count how many times distinct words appear in a column Oracle 12c SQL

How can you count how many times all the distinct words in a column appear

Below is an example and expected output

+--------+------------------------------+
| PERIOD |            STRING            |
+--------+------------------------------+
|        |                              |
| 1      | this is some text            |
|        |                              |
| 2      | more text                    |
|        |                              |
| 3      | this could be some more text |
+--------+------------------------------+

+-------+-------+
| WORD  | COUNT |
+-------+-------+
|       |       |
| this  | 2     |
|       |       |
| is    | 1     |
|       |       |
| some  | 2     |
|       |       |
| text  | 3     |
|       |       |
| more  | 2     |
|       |       |
| could | 1     |
|       |       |
| be    | 1     |
+-------+-------+

Thanks,

You can use Hierarchical query such as

WITH t2 AS
(
 SELECT REGEXP_SUBSTR(LOWER(string),'[^[:space:]]+',1,level) AS word
   FROM t  
CONNECT BY level <= REGEXP_COUNT(LOWER(string),'[:space:]') + 1
    AND PRIOR SYS_GUID() IS NOT NULL
    AND PRIOR period = period
)    
SELECT word, COUNT(*) AS count
  FROM t2
 WHERE word IS NOT NULL
 GROUP BY word

Demo

PS LOWER() function is applied in order to get rid of problem related to case-sensitivity.

The trick is splitting the string into words. One method uses a recursive CTE:

with words(word, string, n) as (
      select regexp_substr(string, '[^ ]+', 1, 1) as word, string, 1 as n
      from t
      union all
      select regexp_substr(string, '[^ ]+', 1, n + 1), string, n + 1
      from words
      where regexp_substr(string, '[^ ]+', 1, n + 1) is not null
     )
select word, count(*)
from words
group by word;

Here is a db<>fiddle.

You can do it without (slow) regular expressions using simple string functions:

WITH word_bounds ( string, start_pos, end_pos ) AS (
  SELECT string,
         1,
         INSTR( string, ' ', 1 )
  FROM   table_name
UNION ALL
  SELECT string,
         end_pos + 1,
         INSTR( string, ' ', end_pos + 1 )
  FROM   word_bounds
  WHERE  end_pos > 0
),
words ( word ) AS (
SELECT CASE end_pos
       WHEN 0
       THEN SUBSTR( string, start_pos )
       ELSE SUBSTR( string, start_pos, end_pos - start_pos )
       END
FROM   word_bounds
)
SELECT word,
       COUNT(*) AS frequency
FROM   words
GROUP BY
       word
ORDER BY
       frequency desc, word;

Which, for the sample data:

CREATE TABLE table_name ( PERIOD, STRING ) AS
SELECT 1, 'this is some text' FROM DUAL UNION ALL
SELECT 2, 'more text' FROM DUAL UNION ALL
SELECT 3, 'this could be some more text' FROM DUAL

Outputs:

\nWORD | FREQUENCY\n:---- |  --------: \ntext | 3 \nmore | 2 \nsome | 2 \nthis | 2 \nbe | 1 \ncould | 1 \nis | 1 \n

There is a discussion on the performance of different ways of splitting delimited strings here .

db<>fiddle here

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM