简体   繁体   English

Teradata SQL在文本字段中选择多个大写字母的字符串

[英]Teradata SQL select string of multiple capital letters in a text field

Any help would be much appreciated on figuring out how to identify Acronyms within a text field that has mixed upper and lower case letters. 在弄清楚如何识别混合了大小写字母的文本字段中的首字母缩略词时,任何帮助将不胜感激。
For example, we might have " we used the BBQ sauce on the Chicken " I need my query to SELECT "BBQ" and nothing else in the cell. 例如,我们可能有“ 我们在鸡肉上使用了烧烤酱 ”,我需要查询以选择“ BBQ”,而单元格中没有其他内容。 There could be multiple capitalized string per row The output should include the uppcase string. 每行可能有多个大写字符串。输出应包含uppcase字符串。

Any ideas are much appreciated!! 任何想法都非常感谢!

This is going to be kind of ugly. 这将是丑陋的。 I tried to use REGEXP_SPLIT_TO_TABLE to just pull out the all caps words, but couldn't make it work. 我尝试使用REGEXP_SPLIT_TO_TABLE只是大写字母,但无法正常工作。

I would do it by first using strtok_split_to_table , so each word will end up in it's own row. 我会先使用strtok_split_to_table来做到这strtok_split_to_table ,因此每个单词都将以其自己的行结尾。

First, some dummy data: 首先,一些伪数据:

create volatile table vt 
(id integer,
col1 varchar(20))
on commit preserve rows;

insert into vt
values (1,'foo BAR');

insert into vt
values (2,'fooBAR');

insert into vt
values(3,'blah FOO FOO blah');

We can use strtok_split_to_table on this: 我们可以在上面使用strtok_split_to_table:

select
t.*
from table
(strtok_split_to_table(vt.id ,vt.col1,' ')
returns
(tok_key integer 
,tok_num INTEGER
,tok_value VARCHAR(30)
)) AS t

That will split each value into separate rows, using a space as a delimiter. 它将使用空格作为分隔符将每个值拆分为单独的行。

Finally, we can compare each of those values to that value in upper case: 最后,我们可以将这些值中的每一个都与该值进行大写比较:

select
vt.id,
vt.col1,
tok_key,
tok_num,
tok_value,
case when upper(t.tok_value) = t.tok_value (CASESPECIFIC) then tok_value else '0' end
from
(
select
t.*
from table
(strtok_split_to_table(vt.id ,vt.col1,' ')
returns
(tok_key integer 
,tok_num INTEGER
,tok_value VARCHAR(30)
)) AS t
) t
inner join vt
    on t.tok_key = vt.id
order by id,tok_num

Taking our lovely sample data, you'll get: 使用我们可爱的样本数据,您将获得:

+----+-------------------+---------+---------+-----------+-------------+
| id |       col1        | tok_key | tok_num | tok_value | TEST_OUTPUT |
+----+-------------------+---------+---------+-----------+-------------+
|  1 | foo BAR           |       1 |       1 | foo       | 0           |
|  1 | foo BAR           |       1 |       2 | BAR       | BAR         |
|  2 | fooBAR            |       2 |       1 | fooBAR    | 0           |
|  3 | blah FOO FOO blah |       3 |       1 | blah      | 0           |
|  3 | blah FOO FOO blah |       3 |       2 | FOO       | FOO         |
|  3 | blah FOO FOO blah |       3 |       3 | FOO       | FOO         |
|  3 | blah FOO FOO blah |       3 |       4 | blah      | 0           |
+----+-------------------+---------+---------+-----------+-------------+

Defining acronyms as all uppercase words with 2 to 5 characters with a '\\b[AZ]{2,5}\\b' regex: 使用'\\b[AZ]{2,5}\\b'正则表达式将首字母缩写词定义为2至5个字符的所有大写单词:

WITH cte AS
 ( -- using @Andrew's Volatile Table 
   SELECT * 
   FROM vt
   -- only rows containing acronyms
   WHERE RegExp_Similar(col1, '.*\b[A-Z]{2,5}\b.*') = 1 
 )
SELECT
   outkey,
   tokenNum,
   CAST(RegExp_Substr(Token, '[A-Z]*') AS VARCHAR(5)) AS acronym -- 1st uppercase word 
   --,token
FROM TABLE
    ( RegExp_Split_To_Table
        ( cte.id,
          cte.col1,

             -- split before an acronym, might include additional characters after
             -- [^A-Z]*? = any number of non uppercase letters (removed)
             -- (?= ) = negative lookahead, i.e. check, but don't remove
          '[^A-Z]*?(?=\b[A-Z]{2,5}\b)',

          '' -- defaults to case sensitive
        ) RETURNS
            ( outKey INT,
              TokenNum INT,
              Token VARCHAR(30000) -- adjust to match the size of your input column 
            )
    ) AS t
WHERE acronym <> ''

I am not 100% sure what are you trying to do but I thing you have many options. 我不确定您要做什么,但我不确定您有很多选择。 Ie: 即:

Option 1) check if the acronym (like BBQ) exist in the string (basic syntax) 选项1)检查首字母缩写词(如BBQ)是否存在于字符串中(基本语法)

SELECT CHARINDEX ('BBQ',@string)

In this case you would need a table of all know acronyms you want to check for and then loop through each of them to see if there is a match for your string and then return the acronym. 在这种情况下,您将需要一个表,其中包含要检查的所有已知首字母缩写词,然后遍历它们中的每一个,以查看您的字符串是否存在匹配项,然后返回首字母缩写词。

DECLARE @string VARCHAR(100)
SET @string = 'we used the BBQ sauce on the Chicken'

create table : [acrs]
--+--- acronym-----+
--+    BBQ         +
--+    IBM         +
--+    AMD         +
--+    ETC         +
--+----------------+

SELECT acronym FROM [acrs] WHERE CHARINDEX ([acronym], @string ) > 0)

This should return : 'BBQ' 这应该返回:'BBQ'

Option 2) load up all the upper case characters into a temp table etc. for further logic and processing. 选项2)将所有大写字符加载到临时表等中,以进行进一步的逻辑和处理。 I think you could use something like this... 我想你可以用这样的东西...

DECLARE @string VARCHAR(100)
SET @string = 'we used the BBQ sauce on the Chicken'

-- make table of all Upper case letters and process individually
;WITH cte_loop(position, acrn)
 AS (
        SELECT 1, SUBSTRING(@string, 1, 1)
        UNION ALL
        SELECT position + 1, SUBSTRING(@string, position + 1, 1)
        FROM cte_loop
        WHERE position < LEN(@string) 
 )
SELECT position, acrn, ascii(acrn) AS [ascii]
FROM cte_loop
WHERE ascii(acrn) > 64 AND ascii(acrn) < 91 -- see the ASCII table for all codes

This would return table like this: 这将返回表,如下所示:

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM