简体   繁体   中英

SQL Regex substr pattern match

Okay, so I've asked the following to programmers i know, and no one could come up with a way to do this.... Please help if you can!

I'm doing a pattern match for hospital procedures, and in this example, it would be matching ¾ words from one concept to another. Basically, I want to make it so “x, z, y” matches to “x, a, y, z” (keeping in mind I already removed all alphanumeric characters so I can do this. Below is an example that is long-handed, I need to find a way to make it dynamic based on word count instead of doing that for EVERY ITERATION. The example:

'Spinal Fusion' = 'Fusion of the Spine' 
'Mammogram-bilateral' = 'bilateral mammogram scan' 
'Echocardiogram (ECG)' = 'ECG'

I wrote up how it COULD work, but some of these have several dozen iterations, so it would need to be kind of a case when statement. If anyone knows how to make this dynamic, I'd be eternally grateful

    WHEN regexp_count(x.y,'(\w+)+') =4 and regexp_count(a.b,'(\w+)+') =3 – (when the count of words is = to 3 and 4)
    AND (
                (  regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3)) 
        and( 
                   regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3))

        and( 
                   regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))
    )
    or


                (  regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3)) 
        and( 
                   regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))

        and( 
                   regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))
    or


                (  regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3)) 
         and( 
                   regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))    
        and( 
                   regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))

    or


                (  regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3)) 
        and( 
                   regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3)) 
        and( 
                   regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
                or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))
    )
    THEN x.y = a.b

Try Vertica's Text Indexing package.

Docu here: https://www.vertica.com/docs/9.3.x/HTML/Content/Authoring/AdministratorsGuide/Tables/TextSearch/TextSearchConceptual.htm?tocpath=Administrator%27s%20Guide%7CUsing%20Text%20Search%7C_____0

This is an approach that you could use to create an auxiliary table that you can finally join with the base table to get the matching strings:

DROP TABLE IF EXISTS textbase CASCADE;
CREATE TABLE textbase(
  id INT NOT NULL PRIMARY KEY
, txt VARCHAR(32)
) UNSEGMENTED ALL NODES;

INSERT INTO textbase
          SELECT 0,'Spinal Fusion'
UNION ALL SELECT 1,'Fusion of the Spine' 
UNION ALL SELECT 2,'Mammogram - bilateral'
UNION ALL SELECT 3,'bilateral mammogram scan' 
UNION ALL SELECT 4,'Echocardiogram (ECG)'
UNION ALL SELECT 5,'ECG'
;
COMMIT;

-- Work with the Vertica standard Text Index package

-- either write your own stemmer, which removes articles and prepositions
-- and typical suffixes, or do the below - adding a pre-stemmed column.
ALTER TABLE textbase ADD prestemmed VARCHAR(32) DEFAULT 
 REGEXP_REPLACE(
   REGEXP_REPLACE(
     REGEXP_REPLACE(
       txt
     -- remove articles
     , ' the\b'
     , ''
     , 1
     , 1
     ,'i'
     )
   -- remove prepositions
   , ' of\b'
   , ''
   , 1
   , 1
   ,'i'
   )
 -- remove "al" and "e" suffixes
 , 'e\b|al\b'
 , ''
 , 1
 , 1
 ,'i'
);

-- Create your text index
CREATE TEXT INDEX textindex ON textbase(id,prestemmed) 
TOKENIZER v_txtindex.BasicLogTokenizer (LONG VARCHAR)
STEMMER v_txtindex.Stemmer(LONG VARCHAR)
;

-- The text index table joins to the INTEGER primary key of the base table using "doc_id"
-- and has one row per token / keyword
SELECT * FROM textbase JOIN textindex ON id=doc_id ORDER BY doc_id;

-- out  id |           txt            |       prestemmed       |     token      | doc_id 
-- out ----+--------------------------+------------------------+----------------+--------
-- out   0 | Spinal Fusion            | Spin Fusion            | spin           |      0
-- out   0 | Spinal Fusion            | Spin Fusion            | fusion         |      0
-- out   1 | Fusion of the Spine      | Fusion Spin            | spin           |      1
-- out   1 | Fusion of the Spine      | Fusion Spin            | fusion         |      1
-- out   2 | Mammogram - bilateral    | Mammogram - bilater    | mammogram      |      2
-- out   2 | Mammogram - bilateral    | Mammogram - bilater    | bilat          |      2
-- out   3 | bilateral mammogram scan | bilater mammogram scan | scan           |      3
-- out   3 | bilateral mammogram scan | bilater mammogram scan | mammogram      |      3
-- out   3 | bilateral mammogram scan | bilater mammogram scan | bilat          |      3
-- out   4 | Echocardiogram (ECG)     | Echocardiogram (ECG)   | echocardiogram |      4
-- out   4 | Echocardiogram (ECG)     | Echocardiogram (ECG)   | ecg            |      4

With the text index as above in place, you can then apply a 3-of-4 keyword match by counting words vs matching words, creating an in-line table you can again join with the base table:

WITH -- count number of tokens per doc_id ...
wcount AS (
   SELECT 
     doc_id
   , count(*) AS wcount
   FROM textindex
   GROUP BY 1
) 
, 
-- count how many matches in tokens we have, where the "doc_id" is not equal ...
-- and, counting these, we have over 75% of the total tokens matching
matchcount AS (
   SELECT 
     a.doc_id AS a_doc_id
   , b.doc_id AS b_doc_id
   , count(*) AS matchcount
   FROM textindex a
   JOIN textindex b USING (token)
   WHERE a.doc_id <> b.doc_id
   GROUP BY 
     1
   , 2
   HAVING count(*) > (SELECT wcount * .75 FROM wcount WHERE doc_id = a.doc_id)
)
SELECT
  QUOTE_LITERAL(a.txt) ||' is probably equal to '||QUOTE_LITERAL(b.txt) AS assumption
FROM matchcount
JOIN textbase a ON a.id=a_doc_id
JOIN textbase b ON b.id=b_doc_id
;
-- out                                 assumption
-- out -------------------------------------------------------------------------
-- out  'Spinal Fusion' is probably equal to 'Fusion of the Spine'
-- out  'Fusion of the Spine' is probably equal to 'Spinal Fusion'
-- out  'Mammogram - bilateral' is probably equal to 'bilateral mammogram scan'
-- out  'ECG' is probably equal to 'Echocardiogram (ECG)'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM