简体   繁体   中英

SQL Server : extract domain and params from 1 million rows into temp table

I have just over a million rows or Urls in one column. The column name is [url] and the table name is redirects.

I'm running SQL Server 2014.

I need a way to extract the sub domain for each url into a new column in a temp table.

Ideally at the same type select distinct param names for the query string into another column and the param values into another column

My main concern is performance not locking up the server while looping through a million rows.

I would be happy to run 3 queries to get the results if it makes more sense

Examples of the column data:

https://www.google.com/ads/ga-audiences?v=1&aip=1&t=sr&_r=4&tid=UA-9999999-1&cid=9999107657.199999834&jid=472999996&_v=j66&z=1963999907

https://track.kspring.com/livin-like-a-star#pid=370&cid=6546&sid=front

So I end up with 3 columns in a temp table

URL               | Param | Qstring
------------------+-------+----------
www.google.com        | v     | 1
www.google.com        | aip   | 1
www.google.com        | t     | dc
www.google.com        | tid   | UA-1666666-1
www.google.com        | jid   | 472999996
track.kspring.com | pid   | 370
track.kspring.com | cid   | 6546
track.kspring.com | sid   | front

I've been looking at some examples to extract the domain name from a string but I don't have much experience with regex or string manipulation.

This is the kind of processing at which .Net CLR functions excel. Just use Uri and parse away, from a CLR Table Value Function (so that you can output more than one column in one single call).

Grab a copy of NGrams8K and you can do this:

-- sample data
declare @table table ([url] varchar(8000));
insert @table values 
('https://www.google.com/ads/ga-audiences?v=1&aip=1&t=sr&_r=4&tid=UA-9999999-1&cid=9999107657.199999834&jid=472999996&_v=j66&z=1963999907'),
('https://track.kspring.com/livin-like-a-star#pid=370&cid=6546&sid=front');

declare @delimiter varchar(20)  = '%[#?;]%'; -- customizable parameter for parsing parameter values

-- solution
select
  [url]   = substring([url], a1.startPos, a2.aLen-a1.startPos),
  [param] = substring(item, 1, charindex('=', split.item)-1),
  qString = substring(item, charindex('=', split.item)+1, 8000)
from @table t
cross apply (values (charindex('//',[url])+2)) a1(startPos)
cross apply (values (charindex('/',[url],a1.startPos)))  a2(aLen)
cross apply
(
  select split.item
  from (values (len(substring([url], a2.aLen,8000)), 1)) as l(s,d)
  cross apply
  ( select -(l.d) union all
    select ng.position
    from dbo.NGrams8k(substring([url], a2.aLen,8000), l.d) as ng
    where token LIKE @delimiter
  ) as d(p)
  cross apply (values(replace(substring(substring([url], a2.aLen,8000), d.p+l.d,
           isnull(nullif(patindex('%'+@delimiter+'%', 
           substring(substring([url], a2.aLen,8000), d.p+l.d, l.s)),0)-1, l.s+l.d)),
         '&amp',''))) split(item)
  where split.item like '%=%'
) split(item);

Results

url                 param   qString
------------------- ------- ---------------------------------
www.google.com      v       1
www.google.com      aip     1
www.google.com      t       sr
www.google.com      _r      4
www.google.com      tid     UA-9999999-1
www.google.com      cid     9999107657.199999834
www.google.com      jid     472999996
www.google.com      _v      j66
www.google.com      z       1963999907
track.kspring.com   pid     370
track.kspring.com   cid     6546
track.kspring.com   sid     front

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM