简体   繁体   中英

SQL query to extract words from a text field

I am creating a SQL script wherein I have to output n (user input) number of words from the text field (varchar(max)) starting from the word (search terms) specified in a different table. So for example, my text string is "The quick brown fox jumps over the lazy dog" and I want to output 3 words from the word 'brown'. So, my output should be "brown fox jumps over".

This operation needs to run on a set of two different tables. First table will contain the text field from multiple documents. Second table will consists of a number of search terms(the word 'brown' in above scenario) starting from which we will generate the output. (Please let me know if I am unclear with this part).

I have made a code but have gone wrong somewhere with the syntax or understanding the SQL bit. Below is a snippet of the code that is generating the error.

        SELECT @Loopcounter = 0,
        @Termcount = Count([termName])
        FROM #temp_table -- temp_table contains the list of search terms

        WHILE ( @Loopcounter < @Termcount )
        BEGIN 

        SELECT @SearchTerm = [termName] FROM #temp_table ORDER BY RowNum 
        OFFSET @Loopcounter ROWS
        FETCH NEXT 1 ROWS ONLY  -- to iterate all the search terms in temp_table on all documents

        SET @Spacecounter = 0;
        SET @Lastword = 0;

        SELECT 
            [DocID],
            [ExtractedText],
            (
            SELECT @Lastword = CHARINDEX( @SearchTerm, ExtractedText ) --position of search term in text field of a document
            FROM [EDDSDBO].[Document] t1 --contains the list of all documents and their text fields
            WHERE d.ArtifactID = t1.ArtifactID  -- to match the document id outside of the loop

            WHILE ( @Spacecounter <= @Proxnum ) --@Proxnum is the number of words required by the user
            -- this loop will find spaces after the search term and will give the position of @proxnum space after the search term
            BEGIN

            SELECT @Lastword = CHARINDEX( ' ', ExtractedText, @Lastword ) --to find the first space after the search term
            FROM [EDDSDBO].[Document] t2
            WHERE d.ArtifactID = t2.ArtifactID

            SET @Lastword = @Lastword + 1
            SET @Spacecounter = @Spacecounter + 1

            END

            SELECT SUBSTRING ( ExtractedText, CHARINDEX( @SearchTerm, ExtractedText ), @Lastword - CHARINDEX( @SearchTerm, ExtractedText ) )
            FROM [EDDSDBO].[Document] t3 --to extract the words from starting of search term till the numbers of words required
            WHERE d.ArtifactID = t3.ArtifactID
            )
            AS [After Hit]
        FROM [EDDSDBO].[Document] d
        WHERE CONTAINS ( ExtractedText, @SearchTerm) --to only search the document that contains that search term

        SET @Loopcounter = @Loopcounter + 1

        END

I know that there is a lot of script there with not much context, but if anyone can help me out with this, please post in your answers. I assume that I went wrong in calling the loop inside the select statement, but I did not see an alternative for that.

Let me know if you need more context in order to understand the requirements of this SQL script. THANKS!

This was not easy. I spent too much time on this and I believe sql-server has some text-searching features that you can install or turn on. Nevertheless, here's an approach that for the most part should meet your needs. You'll have to tweak it here and there, but it works for the sample data I provided below.

Setup:

You never know your future requirements, so you minus well build your system to allow flexibility. So here's a search table that not only has search terms, but also has a column for how many words after the term you want to extract:

declare @searches table (
    termName nvarchar(50), 
    wordsAfter int, 
    rowNum int identity(1,1)
);

insert @searches values 
    ('brown', 3),
    ('green', 2);

And then here's a documents table that samples what I believe your eddsdbo.document table does:

declare @documents table (
    docId int identity(1,1), 
    contents nvarchar(max)
);

insert @documents values 
    ('The quick brown fox jumps over the lazy dog'),
    ('The slow green turtle crawls under the brown and yellow giraffe');

Solution:

Okay, first you want to split your document contents into individual words:

declare @splittedWords table (
    docId int,
    wordNum int,
    word nvarchar(50)
);

with

    splitWords as (

        select      docId, 
                    contents, 
                    start = charindex(' ', contents) + 1,
                    conLen = len(contents),
                    wordNum = 1
        from        @documents

        union all
        select      docId,
                    ap.contents,
                    start = charindex(' ', ap.contents) + 1,
                    conLen = len(ap.contents),
                    wordNum = wordNum + 1
        from        splitWords
        cross apply (select contents = 
                        substring(contents, start, conLen - 1)
                    ) ap
        where       start > 1

    )

    insert      @splittedWords
    select      docId,
                wordNum,
                word = iif(
                    wordNum = max(wordNum) over(partition by docId), 
                    contents, 
                    substring(contents, 0, start - 1)
                )
    from        splitWords;

Now, for each search term, you want to get the position of the word in the contents, and the words that come after:

declare @filteredSplits table (
    search nvarchar(50),
    docId int,
    wordNum int,
    word nvarchar(50)
);

insert      @filteredSplits 
select      search = finds.word, 
            w.docId, 
            w.wordNum, 
            w.word 
from        @searches s 
join        @splittedWords finds on s.termName = finds.word
join        @splittedWords w
                on finds.docId = w.docId
                and w.wordNum between finds.wordNum and finds.wordNum + s.wordsAfter;

And finally, concatenate:

select      fs.search,
            fs.docId,
            extract = stuff((
                select      ' ' + sub.word
                from        @filteredSplits sub
                where       sub.docId = fs.docId
                and         sub.search = fs.search
                order by    sub.wordNum
                for xml     path('')
            ), 1, 1, '')
from        @filteredSplits fs
group by    fs.search, fs.docId

Results:

+-------------------------------------------+
| search | docId | extract                  |
+-------------------------------------------+
| brown  |   1   | brown fox jumps over     |
| brown  |   2   | brown and yellow giraffe | 
| green  |   2   | green turtle crawls      | 
+-------------------------------------------+

This is easy using a word-level n-gram function (something covered here ). At the end of this this post is the code to create the function I'll use to solve your problem. First, a quick wngrams2012 demo. This code will split your string into 4-Grams (# of words plus the search term):

Query:

DECLARE 
  @string VARCHAR(MAX) = 'The quick brown fox jumps over the lazy dog',
  @search VARCHAR(100) = 'Brown',
  @words  INT          = 3;

SELECT
  ng.ItemNumber,
  ng.ItemIndex,
  ng.ItemLength,
  ng.Item
FROM dbo.wngrams2012(@string, @words+1) AS ng;

Results:

ItemNumber  ItemIndex   ItemLength  Item
----------- ----------- ----------- ----------------------
1           1           20          The quick brown fox 
2           5           22          quick brown fox jumps 
3           11          21          brown fox jumps over 
4           17          19          fox jumps over the 
5           21          20          jumps over the lazy 
6           27          17          over the lazy dog

Now for your specific problem:

DECLARE 
  @string VARCHAR(MAX) = 'The quick brown fox jumps over the lazy dog',
  @search VARCHAR(100) = 'Brown',
  @words  INT          = 3;

SELECT TOP (1)
  ItemLength = ng.ItemLength, 
  Item       = ng.Item
FROM        (VALUES(LEN(@string), CHARINDEX(@search,@string))) AS s(Ln,Si)
CROSS APPLY (VALUES(s.Ln-s.Si+1))                              AS nsl(Ln)
CROSS APPLY (VALUES(SUBSTRING(@string,s.Si,nsl.Ln)))           AS ns(txt)
CROSS APPLY dbo.wngrams2012(ns.txt, @words+1)                  AS ng
WHERE       s.Si > 0
ORDER BY    ng.ItemNumber;

Results:

ItemLength   Item
------------ ----------------------
21           brown fox jumps over 

A couple other examples. "Quick" and 1, returns:

ItemLength   Item
------------ --------------
12           quick brown 

"fox" and 4 returns:

ItemLength   Item
------------ -------------------------
24           fox jumps over the lazy 

UPDATE: Against a table

I forgot to include this. Here's the words in two separate tables:

DECLARE @sometable  TABLE(someid INT IDENTITY, someword VARCHAR(100));
DECLARE @sometable2 TABLE(someid INT IDENTITY, someword VARCHAR(MAX));
INSERT  @sometable(someword)  VALUES ('brown'),('fox'),('quick'),('zoo');
INSERT  @sometable2(someword) VALUES ('The quick brown fox jumps over the lazy dog'),
                            ('The brown lazy dog went to the zoo for a quick visit')
DECLARE --@string VARCHAR(MAX) = 'The quick brown fox jumps over the lazy dog',
        @words  INT          = 4;

SELECT 
  SearchId     = t.someid,
  StringId     = t2.someid,
  Searchstring = t.someword,
  Item         = f.Item
FROM        @sometable  AS t
CROSS JOIN  @sometable2 AS t2
CROSS APPLY -- OUTER APPLY
(
  SELECT TOP (1) ng.Item
  FROM        (VALUES(LEN(t2.someword), CHARINDEX(t.someword,t2.someword))) AS s(Ln,Si)
  CROSS APPLY (VALUES(s.Ln-s.Si+1))                                     AS nsl(Ln)
  CROSS APPLY (VALUES(SUBSTRING(t2.someword,s.Si,nsl.Ln)))                  AS ns(txt)
  CROSS APPLY dbo.wngrams2012(ns.txt, @words+1)                         AS ng
  WHERE       s.Si > 0
  ORDER BY    ng.ItemNumber
) AS f;

Returns:

SearchId  StringId  Searchstring   Item
--------- --------- -------------- ------------------------------
1         1         brown          brown fox jumps over the 
2         1         fox            fox jumps over the lazy 
3         1         quick          quick brown fox jumps over 
1         2         brown          brown lazy dog went to 
4         2         zoo            zoo for a quick visit

Note the OUTER APPLY will cause the query to return rows when the search item is not found in the search string.

Purely set-based, fully parallizeable (multithreadable), no loops/cursors/slow iteration.

The functions:

CREATE FUNCTION dbo.NGrams2B
(
  @string varchar(max), 
  @N      int
)
/****************************************************************************************
Purpose:
 A character-level N-Grams function that outputs a stream of tokens based on an input
 string (@string) up to 2^31-1 bytes (2 GB). For more 
 information about N-Grams see: http://en.wikipedia.org/wiki/N-gram. 

Compatibility: 
 SQL Server 2008+, Azure SQL Database

Syntax:
--===== Autonomous
 SELECT position, token FROM dbo.NGrams2B(@string,@N);

--===== Against a table using APPLY
 SELECT s.SomeID, ng.position, ng.token
 FROM dbo.SomeTable s
 CROSS APPLY dbo.NGrams2B(s.SomeValue,@N) ng;

Parameters:
 @string = varchar(max); the input string to split into tokens 
 @N      = bigint; the size of each token returned

Returns:
 Position = bigint; the position of the token in the input string
 token    = varchar(max); a @N-sized character-level N-Gram token

Developer Notes:
 1. Based on NGrams8k but modified to accept varchar(max)

 2. NGrams2B is not case sensitive

 3. Many functions that use NGrams2B will see a huge performance gain when the optimizer
    creates a parallel execution plan. One way to get a parallel query plan (if the 
    optimizer does not chose one) is to use make_parallel by Adam Machanic which can be 
    found here:
 sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx

 4. Performs about 2-3 times slower than NGrams8k. Only use when you are sure that 
    NGrams8k will not suffice. 

 5. When @N is less than 1 or greater than the datalength of the input string then no 
    tokens (rows) are returned. If either @string or @N are NULL no rows are returned.
    This is a debatable topic but the thinking behind this decision is that: because you
    can't split 'xxx' into 4-grams, you can't split a NULL value into unigrams and you 
    can't turn anything into NULL-grams, no rows should be returned.

    For people who would prefer that a NULL input forces the function to return a single
    NULL output you could add this code to the end of the function:

    UNION ALL 
    SELECT 1, NULL
    WHERE NOT(@N > 0 AND @N <= DATALENGTH(@string)) OR (@N IS NULL OR @string IS NULL)

 6. NGrams8k can also be used as a tally table with the position column being your "N" 
    row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to split
    it into unigrams then only return the position column. NGrams8k will get you up to 
    8000 numbers. There will be no performance penalty for sorting by position in 
    ascending order but there is for sorting in descending order. To get the numbers in
    descending order without forcing a sort in the query plan use the following formula:
    N = <highest number>-position+1. 

 Pseudo Tally Table Examples:
    --===== (1) Get the numbers 1 to 100000 in ascending order:
    SELECT N = position FROM dbo.NGrams2B(REPLICATE(CAST(0 AS varchar(max)),100000),1);

    --===== (2) Get the numbers 1 to 100000 in descending order:
    DECLARE @maxN bigint = 100000;
    SELECT N = @maxN-position+1
    FROM dbo.NGrams2B(REPLICATE(CAST(0 AS varchar(max)),@maxN),1)
    ORDER BY position;

 7. NGrams8k is deterministic. For more about deterministic functions see:
    https://msdn.microsoft.com/en-us/library/ms178091.aspx

Usage Examples:
--===== Turn the string, 'abcd' into unigrams, bigrams and trigrams
 SELECT position, token FROM dbo.NGrams2B('abcd',1); -- bigrams  (@N=1)
 SELECT position, token FROM dbo.NGrams2B('abcd',2); -- bigrams  (@N=2)
 SELECT position, token FROM dbo.NGrams2B('abcd',3); -- trigrams (@N=3)

---------------------------------------------------------------------------------------
Revision History:
 Rev 00 - 20150909 - Initial Developement - Alan Burstein 
 Rev 01 - 20151029 - Added ISNULL logic to the TOP clause for both parameters: @string 
                     and @N. This will prevent a NULL string or NULL @N from causing an 
                     "improper value" to be passed to the TOP clause. - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH L1(N) AS 
(
  SELECT N 
  FROM (VALUES 
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),
   (0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(N)
), --216 values
iTally(N) AS 
(
  SELECT 
    TOP (
          ABS(CONVERT(BIGINT,
          (DATALENGTH(ISNULL(CAST(@string AS varchar(max)),'')) - (ISNULL(@N,1)-1)),0))
        )
    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
  FROM L1 a CROSS JOIN L1 b CROSS JOIN L1 c CROSS JOIN L1 d
  --2,176,782,336 rows: enough to handle varchar(max) -> 2^31-1 bytes
)
SELECT
  position = N,
  token    = SUBSTRING(@string,N,@N)
FROM iTally
WHERE @N > 0 AND @N <= DATALENGTH(CAST(@string AS varchar(max))); 
GO

CREATE FUNCTION dbo.wngrams2012(@string varchar(max), @N bigint)
/*****************************************************************************************
Purpose:
 wngrams2012 accepts a varchar(max) input string (@string) and splits it into a contiguous 
  sequence of @N-sized, word-level tokens.

 Per Wikipedia (http://en.wikipedia.org/wiki/N-gram) an "n-gram" is defined as: 
 "a contiguous sequence of n items from a given sequence of text or speech. The items can
  be phonemes, syllables, letters, words or base pairs according to the application. "
------------------------------------------------------------------------------------------
Compatibility:
 SQL Server 2012+, Azure SQL Database
 2012+ because the function uses LEAD

Parameters:
 @string = varchar(max); input string to spit into n-sized items
 @N      = int; number of items per row

Returns:
 itemNumber = bigint; the item's ordinal position inside the input string
 itemIndex  = int; the items location inside the input string
 item       = The @N-sized word-level token


Determinism:
  wngrams2012  is deterministic

    SELECT ROUTINE_NAME, IS_DETERMINISTIC 
    FROM information_schema.routines where ROUTINE_NAME = 'wngrams2012';

------------------------------------------------------------------------------------------
Syntax:
--===== Autonomous
 SELECT 
   ng.tokenNumber,
   ng.token
 FROM dbo.wngrams2012(@string,@N) ng;

--===== Against another table using APPLY
 SELECT 
   t.someID
   ng.tokenNumber,
   ng.token
 FROM dbo.SomeTable t
 CROSS APPLY dbo.wngrams2012(@string,@N) ng;
-----------------------------------------------------------------------------------------
Usage Examples:

--===== Example #1: Word-level Unigrams:
  SELECT
    ng.itemNumber,
    ng.itemIndex,
    ng.item
  FROM dbo.wngrams2012('One two three four words', 1) ng;

 --Results:
  ItemNumber  position  token
  1           1         one
  2           4         two
  3           8         three
  4           14        four
  5           19        words

--===== Example #2: Word-level Bi-grams:
  SELECT
    ng.itemNumber,
    ng.itemIndex,
    ng.item
  FROM dbo.wngrams2012('One two three four words', 2) ng;

 --Results:
  ItemNumber  position  token
  1           1         One two
  2           4         two three
  3           8         three four
  4           14        four words

--===== Example #3: Only the first two Word-level Bi-grams:
  -- Key: TOP(2) does NOT guarantee the correct result without an order by, which will
  -- degrade performance; see programmer note #5 below for details about sorting.

  SELECT 
    ng.ItemNumber, ng.ItemIndex, ng.ItemLength, ng.Item 
  FROM  dbo.wngrams2012('One two three four words',2) AS ng
  WHERE ng.ItemNumber < 3;

 --Results:
  ItemNumber  ItemIndex  ItemLength  Item
  ----------  ---------  ----------- ---------------------------------------------------
  1           1          8           One two 
  2           4          10          two three 
-----------------------------------------------------------------------------------------
Programmer Notes:
 1. This function requires ngrams8k which can be found here:
    http://www.sqlservercentral.com/articles/Tally+Table/142316/

 2. This function could not have been developed without what I learned reading "Reaping 
    the benefits of the Window functions in T-SQL"  by Eirikur Eiriksson
    The code looks different but, under the covers, WNGrams2012 
   is simply a slightly altered rendition of DelimitedSplit8K_LEAD. 

 3. Requires SQL Server 2012

 4. wngrams2012 uses spaces (char(32)) as the delimiter; the text must be pre-formatted
    to address line breaks, carriage returns multiple spaces, etc.

 5. Result set order does not matter and therefore no ORDER BY clause is required. The 
    *observed* default sort order is ItemNumber which means position is also sequential.
    That said, *any* ORDER BY clause will cause a sort in the execution plan. If you need
    to sort by position (ASC) or itemNumber (ASC), follow these steps to avoid a sort:

      A. In the function DDL, replace COALESCE/NULLIF for N1.N with the N. e.g. Replace
         "COALESCE(NULLIF(N1.N,0),1)" with "N" (no quotes)

      B. Add an ORDER BY position (which is logically identical to ORDER BY itemnumber).

      C. This will cause the position of the 1st token to be 0 instead of 1 when position
         is included in the final result-set. To correct this, simply use this formula:
         "COALESCE(NULLIF(position,0),1)" for "position". Note this example:

         SELECT
           ng.itemNumber,
           itemIndex = COALESCE(NULLIF(ng.itemIndex,0),1),
           ng.item
         FROM dbo.wngrams2012('One two three four words',2) ng
         ORDER BY ng.itemIndex;

-----------------------------------------------------------------------------------------
Revision History:
 Rev 00 - 20171116 - Initial creation - Alan Burstein
 Rev 01 - 20200206 - Misc updates - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
WITH
delim(RN,N) AS -- locate all of the spaces in the string
(
  SELECT 0,0 UNION ALL
  SELECT ROW_NUMBER() OVER (ORDER BY ng.position),   ng.position
  FROM dbo.ngrams2b(@string,1) ng
  WHERE ng.token = ' '
),
tokens(itemNumber,itemIndex,item,itemLength,itemCount) AS -- Create tokens (e.g. split string)
(
  SELECT 
    N1.RN+1,
    N1.N+1, -- change to N then ORDER BY position to avoid a sort
    SUBSTRING(v1.s, N1.N+1, LEAD(N1.N,@N,v2.l) OVER (ORDER BY N1.N)-N1.N),
    LEAD(N1.N,@N,v2.l) OVER (ORDER BY N1.N)-N1.N,
    v2.l-v2.sp-(@N-2) 
     -- count number of spaces in the string then apply the N-GRAM rows-(@N-1) formula
     -- Note: using (@N-2 to compinsate for the extra row in the delim cte).
  FROM delim N1
  CROSS JOIN  (VALUES (@string)) v1(s)
  CROSS APPLY (VALUES (LEN(v1.s), LEN(REPLACE(v1.s,' ','')))) v2(l,sp)
)
SELECT 
  ItemNumber = ROW_NUMBER() OVER (ORDER BY (t.itemIndex)),
    ItemIndex  = t.itemIndex, --ISNULL(NULLIF(t.itemIndex,0),1),
  ItemLength = t.itemLength,
  Item       = t.item
FROM tokens t
WHERE @N > 0 AND t.itemNumber <= t.itemCount; -- startup predicate  
GO

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM