简体   繁体   English

在SQL中提取字符串的一部分

[英]Extracting parts of a string in SQL

I have url data in a table. 我在表格中有网址数据。 I would like to create a view that shows the second level (sld) and top level domain (tld) as well as the subdomain. 我想创建一个显示第二级(sld)和顶级域(tld)以及子域的视图。 How can I extract this in ANSI SQL? 如何在ANSI SQL中提取此内容? The database I am using supports only ansi sql and doesn't have cool functions such as reverse. 我正在使用的数据库仅支持ansi sql,并且没有诸如反向之类的出色功能。

Here is the data: 数据如下:

  TLD =  -- The top-level domain (.com, .org, .info, .us)
  SLD =  -- The second-level domain (twitter, yahoo, facebook, google) second part of URL
  SUBDOMAIN = -- The subdomain domain (www, search.google, search.espn) first part of URL // tricky

Here is the logic I am using. 这是我使用的逻辑。 But I am unable to get the subdomain properly. 但是我无法正确获得该子域。 I would like to reverse and get the remainder after extracting TLD, and SLD, but Vertica doesnt support reverse function. 在提取TLD和SLD之后,我想撤消并获得剩余部分,但是Vertica不支持撤消功能。

Here is the query and sample data (note: SPLIT_PART splits the string at the character specified): 这是查询和示例数据(注意:SPLIT_PART将字符串拆分为指定的字符):

select COALESCE(SPLIT_PART(URL, '.', 3), SPLIT_PART(URL, '.', 2))  as tld, 
             SPLIT_PART(URL, '.', 2) as sld, 
SPLIT_PART(URL, '.', 1) as subdomain from URL_table

The table has 2 columns, date and URL Here are the example URLS: 该表有2列,即日期和URL,以下是示例URL:

search.mywebsearch.com   (TLD = com, SLD = mywebsearch, subdomain = search)
search.earthlink.net     
topix.com
main.welcomescreen.intrepid.com
ad.yieldmanager.com
google.com
news.google.com

This is a really hard thing to do right, especially if your data is noisy, as is the case with big data. 这确实是一件很难做的事情,尤其是在您的数据嘈杂的情况下(大数据就是这种情况)。

Can you ever get http:// as a prefix? 您能否以http://作为前缀? What about sites like www.sub.dom.com? 像www.sub.dom.com这样的网站呢? Is everything after the .TLD scrubbed out already? .TLD之后的所有内容都已经清除了吗?

For these reasons, we were wary about trying to implement splitting in SQL. 由于这些原因,我们对尝试在SQL中实现拆分持谨慎态度。 Instead, we used Vertica's UDTF feature and wrote a splitter in C++. 相反,我们使用了Vertica的UDTF功能,并用C ++编写了一个splitter。 I think we'd rather not do it, but we just don't trust SQL to be robust enough. 我想我们不愿意这样做,但是我们只是不相信SQL足够强大。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM