简体   繁体   中英

How to count number of words in delimited string in Oracle SQL

For example i have the table called 'Table1'. and column called 'country'. I want to count the value of word in string.below is my data for column 'country':

country:
"japan singapore japan chinese chinese chinese"

expected output: in above data we can see the japan appear two time, singapore once and chinese 3 times.i want to count value of word where japan is count as one, singapore as one and chinese as one. hence the ouput will be 3. please help me

ValueOfWord: 3

Firstly, it is a bad design to store multiple values in a single column as delimited string. You should consider normalizing the data as a permanent solution.

With the denormalized data, you could do it in a single SQL using REGEXP_SUBSTR :

SELECT COUNT(DISTINCT(regexp_substr(country, '[^ ]+', 1, LEVEL))) as "COUNT"
FROM table_name
  CONNECT BY LEVEL <= regexp_count(country, ' ')+1 
/

Demo:

SQL> WITH sample_data AS
  2    ( SELECT 'japan singapore japan chinese chinese chinese' str FROM dual
  3    )
  4  -- end of sample_data mocking real table
  5  SELECT COUNT(DISTINCT(regexp_substr(str, '[^ ]+', 1, LEVEL))) as "COUNT"
  6  FROM sample_data
  7    CONNECT BY LEVEL <= regexp_count(str, ' ')+1
  8  /

     COUNT
----------
         3

See Split single comma delimited string into rows in Oracle to understand how the query works.


UPDATE

For multiple delimited string rows you need to take care of the number of rows formed by the CONNECT BY clause.

See Split comma delimited strings in a table in Oracle for more ways of doing the same task.

Setup

Let's say you have a table with 3 rows like this:

SQL> CREATE TABLE t(country VARCHAR2(200));

Table created.

SQL> INSERT INTO t VALUES('japan singapore japan chinese chinese chinese');

1 row created.

SQL> INSERT INTO t VALUES('singapore indian malaysia');

1 row created.

SQL> INSERT INTO t VALUES('french french french');

1 row created.

SQL> COMMIT;

Commit complete.

SQL> SELECT * FROM t;

COUNTRY
---------------------------------------------------------------------------
japan singapore japan chinese chinese chinese
singapore indian malaysia
french french french
  • Using REGEXP_SUBSTR and REGEXP_COUNT :

We expect the output as 6 since there are 6 unique strings.

SQL> SELECT COUNT(DISTINCT(regexp_substr(t.country, '[^ ]+', 1, lines.column_value))) count
  2    FROM t,
  3      TABLE (CAST (MULTISET
  4      (SELECT LEVEL FROM dual
  5              CONNECT BY LEVEL <= regexp_count(t.country, ' ')+1
  6      ) AS sys.odciNumberList ) ) lines
  7    ORDER BY lines.column_value
  8  /

     COUNT
----------
         6

There are many other methods to achieve the desired output. Let's see how:

  • Using XMLTABLE
SQL> SELECT COUNT(DISTINCT(country)) COUNT
  2  FROM
  3    (SELECT trim(COLUMN_VALUE) country
  4    FROM t,
  5      xmltable(('"'
  6      || REPLACE(country, ' ', '","')
  7      || '"'))
  8    )
  9  /

     COUNT
----------
         6
  • Using MODEL clause
SQL> WITH
  2       model_param AS
  3       (
  4        SELECT country AS orig_str ,
  5               ' '
  6               || country
  7               || ' '                                 AS mod_str ,
  8               1                                      AS start_pos ,
  9              Length(country)                           AS end_pos ,
 10              (LENGTH(country) -
 11              LENGTH(REPLACE(country, ' '))) + 1        AS element_count ,
 12              0                                      AS element_no ,
 13              ROWNUM                                 AS rn
 14        FROM   t )
 15        SELECT COUNT(DISTINCT(Substr(mod_str, start_pos, end_pos-start_pos))) count
 16        FROM (
 17              SELECT *
 18              FROM   model_param
 19              MODEL PARTITION BY (rn, orig_str, mod_str)
 20              DIMENSION BY (element_no)
 21              MEASURES (start_pos, end_pos, element_count)
 22              RULES ITERATE (2000)
 23              UNTIL (ITERATION_NUMBER+1 = element_count[0])
 24            ( start_pos[ITERATION_NUMBER+1] =
 25                      instr(cv(mod_str), ' ', 1, cv(element_no)) + 1,
 26              end_pos[ITERATION_NUMBER+1] =
 27                      instr(cv(mod_str), ' ', 1, cv(element_no) + 1) )
 28            )
 29         WHERE    element_no != 0
 30    ORDER BY      mod_str , element_no
 31   /

     COUNT
----------
         6

Did you store that kind of string in a single entry?

If not, try

SELECT COUNT(*)
    FROM (SELECT DISTINCT T.country FROM Table1 T)

If yes, I would write an external program to parse the string and return the result you want.

Like using java.

Create a String set.

I would use JDBC to retrieve the record, and use split to split strings in tokens using ' 'delimiter. For every token, if it is not in the set, add it to the set.

When parse finishes, get the length of the set, which is the value you want.

Break the string based on the space delimiter

SELECT COUNT(DISTINCT regexp_substr(col, '[^ ]+', 1, LEVEL))
  FROM T
CONNECT BY LEVEL <= regexp_count(col, ' ')+1

For counting DISTINCT words

    SELECT col,
COUNT(DISTINCT regexp_substr(col, '[^ ]+', 1, LEVEL))
  FROM T
CONNECT BY LEVEL <= regexp_count(col, ' ')+1
GROUP BY col

FIDDLE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM