Okay, so I've asked the following to programmers i know, and no one could come up with a way to do this.... Please help if you can!
I'm doing a pattern match for hospital procedures, and in this example, it would be matching ¾ words from one concept to another. Basically, I want to make it so “x, z, y” matches to “x, a, y, z” (keeping in mind I already removed all alphanumeric characters so I can do this. Below is an example that is long-handed, I need to find a way to make it dynamic based on word count instead of doing that for EVERY ITERATION. The example:
'Spinal Fusion' = 'Fusion of the Spine'
'Mammogram-bilateral' = 'bilateral mammogram scan'
'Echocardiogram (ECG)' = 'ECG'
I wrote up how it COULD work, but some of these have several dozen iterations, so it would need to be kind of a case when statement. If anyone knows how to make this dynamic, I'd be eternally grateful
WHEN regexp_count(x.y,'(\w+)+') =4 and regexp_count(a.b,'(\w+)+') =3 – (when the count of words is = to 3 and 4)
AND (
( regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))
)
or
( regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))
or
( regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,3) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))
or
( regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,1) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,2) = regexp_substr (a.b,'\w+\b',1,3))
and(
regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,1)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,2)
or regexp_substr (x.y,'\w+\b',1,4) = regexp_substr (a.b,'\w+\b',1,3))
)
THEN x.y = a.b
Try Vertica's Text Indexing package.
This is an approach that you could use to create an auxiliary table that you can finally join with the base table to get the matching strings:
DROP TABLE IF EXISTS textbase CASCADE;
CREATE TABLE textbase(
id INT NOT NULL PRIMARY KEY
, txt VARCHAR(32)
) UNSEGMENTED ALL NODES;
INSERT INTO textbase
SELECT 0,'Spinal Fusion'
UNION ALL SELECT 1,'Fusion of the Spine'
UNION ALL SELECT 2,'Mammogram - bilateral'
UNION ALL SELECT 3,'bilateral mammogram scan'
UNION ALL SELECT 4,'Echocardiogram (ECG)'
UNION ALL SELECT 5,'ECG'
;
COMMIT;
-- Work with the Vertica standard Text Index package
-- either write your own stemmer, which removes articles and prepositions
-- and typical suffixes, or do the below - adding a pre-stemmed column.
ALTER TABLE textbase ADD prestemmed VARCHAR(32) DEFAULT
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
txt
-- remove articles
, ' the\b'
, ''
, 1
, 1
,'i'
)
-- remove prepositions
, ' of\b'
, ''
, 1
, 1
,'i'
)
-- remove "al" and "e" suffixes
, 'e\b|al\b'
, ''
, 1
, 1
,'i'
);
-- Create your text index
CREATE TEXT INDEX textindex ON textbase(id,prestemmed)
TOKENIZER v_txtindex.BasicLogTokenizer (LONG VARCHAR)
STEMMER v_txtindex.Stemmer(LONG VARCHAR)
;
-- The text index table joins to the INTEGER primary key of the base table using "doc_id"
-- and has one row per token / keyword
SELECT * FROM textbase JOIN textindex ON id=doc_id ORDER BY doc_id;
-- out id | txt | prestemmed | token | doc_id
-- out ----+--------------------------+------------------------+----------------+--------
-- out 0 | Spinal Fusion | Spin Fusion | spin | 0
-- out 0 | Spinal Fusion | Spin Fusion | fusion | 0
-- out 1 | Fusion of the Spine | Fusion Spin | spin | 1
-- out 1 | Fusion of the Spine | Fusion Spin | fusion | 1
-- out 2 | Mammogram - bilateral | Mammogram - bilater | mammogram | 2
-- out 2 | Mammogram - bilateral | Mammogram - bilater | bilat | 2
-- out 3 | bilateral mammogram scan | bilater mammogram scan | scan | 3
-- out 3 | bilateral mammogram scan | bilater mammogram scan | mammogram | 3
-- out 3 | bilateral mammogram scan | bilater mammogram scan | bilat | 3
-- out 4 | Echocardiogram (ECG) | Echocardiogram (ECG) | echocardiogram | 4
-- out 4 | Echocardiogram (ECG) | Echocardiogram (ECG) | ecg | 4
With the text index as above in place, you can then apply a 3-of-4 keyword match by counting words vs matching words, creating an in-line table you can again join with the base table:
WITH -- count number of tokens per doc_id ...
wcount AS (
SELECT
doc_id
, count(*) AS wcount
FROM textindex
GROUP BY 1
)
,
-- count how many matches in tokens we have, where the "doc_id" is not equal ...
-- and, counting these, we have over 75% of the total tokens matching
matchcount AS (
SELECT
a.doc_id AS a_doc_id
, b.doc_id AS b_doc_id
, count(*) AS matchcount
FROM textindex a
JOIN textindex b USING (token)
WHERE a.doc_id <> b.doc_id
GROUP BY
1
, 2
HAVING count(*) > (SELECT wcount * .75 FROM wcount WHERE doc_id = a.doc_id)
)
SELECT
QUOTE_LITERAL(a.txt) ||' is probably equal to '||QUOTE_LITERAL(b.txt) AS assumption
FROM matchcount
JOIN textbase a ON a.id=a_doc_id
JOIN textbase b ON b.id=b_doc_id
;
-- out assumption
-- out -------------------------------------------------------------------------
-- out 'Spinal Fusion' is probably equal to 'Fusion of the Spine'
-- out 'Fusion of the Spine' is probably equal to 'Spinal Fusion'
-- out 'Mammogram - bilateral' is probably equal to 'bilateral mammogram scan'
-- out 'ECG' is probably equal to 'Echocardiogram (ECG)'
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.