简体   繁体   中英

Extracting parts of a string in SQL

I have url data in a table. I would like to create a view that shows the second level (sld) and top level domain (tld) as well as the subdomain. How can I extract this in ANSI SQL? The database I am using supports only ansi sql and doesn't have cool functions such as reverse.

Here is the data:

  TLD =  -- The top-level domain (.com, .org, .info, .us)
  SLD =  -- The second-level domain (twitter, yahoo, facebook, google) second part of URL
  SUBDOMAIN = -- The subdomain domain (www, search.google, search.espn) first part of URL // tricky

Here is the logic I am using. But I am unable to get the subdomain properly. I would like to reverse and get the remainder after extracting TLD, and SLD, but Vertica doesnt support reverse function.

Here is the query and sample data (note: SPLIT_PART splits the string at the character specified):

select COALESCE(SPLIT_PART(URL, '.', 3), SPLIT_PART(URL, '.', 2))  as tld, 
             SPLIT_PART(URL, '.', 2) as sld, 
SPLIT_PART(URL, '.', 1) as subdomain from URL_table

The table has 2 columns, date and URL Here are the example URLS:

search.mywebsearch.com   (TLD = com, SLD = mywebsearch, subdomain = search)
search.earthlink.net     
topix.com
main.welcomescreen.intrepid.com
ad.yieldmanager.com
google.com
news.google.com

This is a really hard thing to do right, especially if your data is noisy, as is the case with big data.

Can you ever get http:// as a prefix? What about sites like www.sub.dom.com? Is everything after the .TLD scrubbed out already?

For these reasons, we were wary about trying to implement splitting in SQL. Instead, we used Vertica's UDTF feature and wrote a splitter in C++. I think we'd rather not do it, but we just don't trust SQL to be robust enough.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM