简体   繁体   中英

Full Text Search of URL field sql server

Objective: Return all URLs beginning with "https://mywebsite.domain.com/as/product/4/"

Given:

  • Applied full text search on URL field.
  • SQL Server edition: 2014.
  • 20+ Million rows

URL

https://mywebsite.domain.com/as/product/1/production
https://mywebsite.domain.com/as/product/2/items
https://mywebsite.domain.com/as/product/1/affordability
https://mywebsite.domain.com/as/product/3/summary
https://mywebsite.domain.com/as/product/4/schedule
https://mywebsite.domain.com/as/product/4/resources/summary

Query 1:

WHERE CONTAINS (URL, 'https://mywebsite.domain.com/as/product/4')

Result:

All records returned

Query 2 (Added "*" after reading MSDN article )

WHERE CONTAINS (URL, '"https://mywebsite.domain.com/as/product/4*"')

Result:

No records returned

Any assistance would be greatly appreciated.

You can use CONTAINS with a LIKE subquery for matching only a start:

SELECT * 
FROM (
SELECT * 
FROM myTable WHERE CONTAINS (URL, '"https://mywebsite.domain.com/as/product/4/"')
) AS S1 
WHERE S1.URL LIKE 'https://mywebsite.domain.com/as/product/4/%' 

This way, the SLOW LIKE operator query will be run against a smaller set of records

EDIT1: (if WHERE CONTAINS (URL, '"https://mywebsite.domain.com/as/product/4/"') is not filtering Values)

After a lot of searches. the problem is in / . The forward-slash isn't contained in the Noise Words file, but I guess it's classed as a delimiter or Word breaker and therefore isn't searchable.

Read these Topics:

EDIT2:

I found one suggested solution that is

/ is considered as an english wordbreaker You may change It from Registry

  • Navigate to Registry values HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Microsoft SQL Server\\<InstanceRoot>\\MSSearch\\Language\\eng and HKEY_LOCAL_MACHINE\\SOFTWARE\\Microsoft\\Microsoft SQL Server\\<InstanceRoot>\\MSSearch\\Language\\enu
  • clear value for WBreakerClass.

Sql server consider https://mywebsite.domain.com/as/product/4 as one word.

Note: above both path i have taken by assuming that you are using English language as word breaker.

Read more about Word Breaker in This MSDN Topic

Use the Like operator:

WHERE URL LIKE 'https://mywebsite.domain.com/as/product/4%'

The % is a wildcard. This should return all records that start with a pattern match up to the first wildcard % .

Provided you always search start of the string this will ensure optimizer can use index. I assume URL is VARCHAR

Declare @p varchar(500) ='https://mywebsite.domain.com/as/product/4'

Declare @maxChar char(1);
select @maxChar = max(ch)
from (
    select top(256) ch = char(row_number() over(order by (select null)) - 1)
    from sys.all_objects) t;
select @maxChar;

-- ..
WHERE URL > @p AND URL < @p + @maxChar

When comparing strings, Sql server adds trailing spaces to the shorter one. See https://support.microsoft.com/en-us/kb/316626 . According to http://www.ietf.org/rfc/rfc1738.txt , http://www.ietf.org/rfc/rfc1738.txt all allowed URL symbols are greater than space. So the search parameter, 'https://mywebsite.domain.com/as/product/4' for example, will be less than any URL which starts with this parameter and exceeds parameter length.

For similar problems I'm used to two solutions, depending on your needs, mainly on performaces or resources or concurrency.. etc etc..

The LIKE operator could be your best friend also with very big tables.

Indexing
First of all, you need to index your url column, working with 20+ millions records it is not easy task, indexing it could cost you 1.5 - 2.0 Gb of disk space, but you will get your query in NO TIME (milliseconds)

With the index on the column to search, LIKE FixedPattern+% is performed with an index seek, you cannot improve it any further .

First solution:

CREATE NONCLUSTERED INDEX [IX_URL] ON [url_table] ([url]);

DECLARE @Domain VARCHAR(100) = 'https://mywebsite.domain.com/'
DECLARE @Path VARCHAR(100) = 'as/product/'
DECLARE @Product VARCHAR(20) = '4'
DECLARE @LikeAll VARCHAR(100) = @Domain + @Path + @Product + '/%'

SELECT url
FROM url_table
WHERE url LIKE @LikeAll

Second solution
The second option is a bit tricky but very effective.
You said protocol and domain of url are fixed and you need to search for something after.
The following is a technique, you can fine tune it to match your needs.
The idea is to add a virtual (computed) column to your url table and then to add an index on it.
This will greatly reduce index dimensions and improve query performances at the cost of a very little overhead of computing in insert/update

ALTER TABLE url_table ADD path AS (SUBSTRING(url, 30, 4000));
CREATE NONCLUSTERED INDEX [IX_PATH] ON [url_table] ([path]);

DECLARE @Domain VARCHAR(100) = 'https://mywebsite.domain.com/'
DECLARE @Path VARCHAR(100) = 'as/product/'
DECLARE @Product VARCHAR(20) = '4'
DECLARE @LikeMid VARCHAR(100) = @Path + @Product + '/%' 

select @Domain + _path -- pay attention!!
FROM url_table
WHERE url LIKE @SrcAll

Please take note, we are selecting @Domain + _path instead of url, to avoid table access and work only on index data.

If you need other columns in url_table your best option is

declare @l table (id int primary key)
insert  into @l
select id 
from url_table 
where _path like @LikeMid

select url
from url_table
where id in (select id from @l)

very fast

Third solution
This is a variant of second one.
In your example data I see the path contains /product/ followed by a number and I'm assuming it as the product number. Maybe you can consider the following

ALTER TABLE url_table ADD _product AS (cast(substring(url,nullif(CHARINDEX('/product/',url,29)+9,9), CHARINDEX('/',url,nullif(CHARINDEX('/product/',url,29)+9,9))-nullif(CHARINDEX('/product/',url,29)+9,9)) as bigint));
CREATE NONCLUSTERED INDEX [IX_PRODUCT] ON [url] ([_product]);

select id, url
from url_table 
where _product = 4

This will produce a computed column with product number of type integer, the index will be only 500Mb and queries on integers will be super fast.
Also the overhead to select all columns from url_table is very very little so you can SELECT * with almost no performances issues.

PS You can drop your FullText index and save space and resources..

SELECT * FROM myTable WHERE URL LIKE 'https://mywebsite.domain.com/as/product/4%'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM