I have just over a million rows or Urls in one column. The column name is [url]
and the table name is redirects.
I'm running SQL Server 2014.
I need a way to extract the sub domain for each url into a new column in a temp table.
Ideally at the same type select distinct param names for the query string into another column and the param values into another column
My main concern is performance not locking up the server while looping through a million rows.
I would be happy to run 3 queries to get the results if it makes more sense
Examples of the column data:
https://www.google.com/ads/ga-audiences?v=1&aip=1&t=sr&_r=4&tid=UA-9999999-1&cid=9999107657.199999834&jid=472999996&_v=j66&z=1963999907
https://track.kspring.com/livin-like-a-star#pid=370&cid=6546&sid=front
So I end up with 3 columns in a temp table
URL | Param | Qstring
------------------+-------+----------
www.google.com | v | 1
www.google.com | aip | 1
www.google.com | t | dc
www.google.com | tid | UA-1666666-1
www.google.com | jid | 472999996
track.kspring.com | pid | 370
track.kspring.com | cid | 6546
track.kspring.com | sid | front
I've been looking at some examples to extract the domain name from a string but I don't have much experience with regex or string manipulation.
This is the kind of processing at which .Net CLR functions excel. Just use Uri
and parse away, from a CLR Table Value Function (so that you can output more than one column in one single call).
Grab a copy of NGrams8K and you can do this:
-- sample data
declare @table table ([url] varchar(8000));
insert @table values
('https://www.google.com/ads/ga-audiences?v=1&aip=1&t=sr&_r=4&tid=UA-9999999-1&cid=9999107657.199999834&jid=472999996&_v=j66&z=1963999907'),
('https://track.kspring.com/livin-like-a-star#pid=370&cid=6546&sid=front');
declare @delimiter varchar(20) = '%[#?;]%'; -- customizable parameter for parsing parameter values
-- solution
select
[url] = substring([url], a1.startPos, a2.aLen-a1.startPos),
[param] = substring(item, 1, charindex('=', split.item)-1),
qString = substring(item, charindex('=', split.item)+1, 8000)
from @table t
cross apply (values (charindex('//',[url])+2)) a1(startPos)
cross apply (values (charindex('/',[url],a1.startPos))) a2(aLen)
cross apply
(
select split.item
from (values (len(substring([url], a2.aLen,8000)), 1)) as l(s,d)
cross apply
( select -(l.d) union all
select ng.position
from dbo.NGrams8k(substring([url], a2.aLen,8000), l.d) as ng
where token LIKE @delimiter
) as d(p)
cross apply (values(replace(substring(substring([url], a2.aLen,8000), d.p+l.d,
isnull(nullif(patindex('%'+@delimiter+'%',
substring(substring([url], a2.aLen,8000), d.p+l.d, l.s)),0)-1, l.s+l.d)),
'&',''))) split(item)
where split.item like '%=%'
) split(item);
Results
url param qString
------------------- ------- ---------------------------------
www.google.com v 1
www.google.com aip 1
www.google.com t sr
www.google.com _r 4
www.google.com tid UA-9999999-1
www.google.com cid 9999107657.199999834
www.google.com jid 472999996
www.google.com _v j66
www.google.com z 1963999907
track.kspring.com pid 370
track.kspring.com cid 6546
track.kspring.com sid front
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.