简体   繁体   中英

SQL Server 2012 : extract Regex groups

I have text in my database in Markdown format. I'd like to extract links and count the number of matching links I have. I can get a listing of text blocks that contain links using a query similar to this:

SELECT post_text
FROM posts p
WHERE p.body like '%\[%](http%)%' ESCAPE '\'

How do I go to the next step though, and just extract the link portion of the text (the part that is in the parenthesis)? If I can get this, I can count the number of times this specific link is in my dataset.

Some sample data:

"Visit [Google](http://google.com)"    -> Should return "http://google.com"
"Get an [iPhone](http://www.apple.com) (I like it better than Android)"   -> Should return "http://www.apple.com"
"[Example](http://example.com)"    -> Should return "http://example.com"
"This is a message"    -> Nothing to return on this one, no link
"I like cookies (chocolate chip)"  -> Nothing to return on this one, no link
"[Frank] says 'Hello'" -> Nothing to return on this one, no link

I am using SQL Server 2012 (if there are differences between versions in this regard).

Assuming the actual data is no more complex than the stated examples, this should work without resorting to RegEx:

DECLARE @posts TABLE
(
   post_id INT NOT NULL IDENTITY(1, 1),
   post_text NVARCHAR(4000) NOT NULL,
   body NVARCHAR(2048) NULL
);
INSERT INTO @posts (post_text, body) VALUES (N'first',
                                           N'Visit [Google](http://google.com)');
INSERT INTO @posts (post_text, body) VALUES (N'second',
                                           N'Get an [iPhone](http://www.apple.com)');
INSERT INTO @posts (post_text, body) VALUES (N'third',
                                           N'[Example](http://example.com)');
INSERT INTO @posts (post_text, body) VALUES (N'fourth',
                                           N'This is a message');
INSERT INTO @posts (post_text, body) VALUES (N'fifth',
                                           N'I like cookies (chocolate chip)');
INSERT INTO @posts (post_text, body) VALUES (N'sixth',
                                           N'[Frankie] says ''Relax''');
INSERT INTO @posts (post_text, body) VALUES (N'seventh',
                                           NULL);


SELECT p.post_text,
       SUBSTRING(
                  p.body,
                  CHARINDEX(N'](', p.body) + 2,
                  CHARINDEX(N')', p.body) - (CHARINDEX(N'](', p.body) + 2)
                ) AS [URL]
FROM   @posts p
WHERE  p.body like '%\[%](http%)%' ESCAPE '\';

Output:

post_text  URL
first      http://google.com
second     http://www.apple.com
third      http://example.com

PS:
If you really want to use Regular Expressions, they can only be done via SQLCLR. You can write your own or download pre-done libraries. I wrote one such library, SQL# , that has a Free version that includes the RegEx functions. But those should only be used if a T-SQL solution cannot be found, which so far is not the case here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM