简体   繁体   English

如何遍历3500万行的表-MySQL

[英]How to traverse a 35 millions of rows table - Mysql

I have a really large mysql table which stores domains and subdomains. 我有一个非常大的mysql表,用于存储域和子域。 And its create syntax like below 其创建语法如下所示

CREATE TABLE `domain` (
  `domain_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `domain_name` varchar(255) COLLATE utf8_turkish_ci NOT NULL,
  PRIMARY KEY (`domain_id`),
  UNIQUE KEY `dn` (`domain_name`),
  KEY `ind_domain_name` (`domain_name`)
) ENGINE=InnoDB AUTO_INCREMENT=78364364 DEFAULT CHARSET=utf8 COLLATE=utf8_turkish_ci;

and it stores values like 它存储像

domain_id | domain_name

1         | a.example.com

2         | b.example.com 

3         | example.com

4         | facebook.com

5         | a.facebook.com

6         | google.com

I want to find subdomains of any top domain and then I'll match subdomains with its' parent domain. 我想找到任何顶级域名的子域名,然后将子域名与其父域名进行匹配。

for example a.example.com and b.example.com are subdomains of example.com so on my new column which named parent_domain_id I will set example.com's domain_id. 例如a.example.com和b.example.com是example.com的子域,因此在名为parent_domain_id的新列中,我将设置example.com的domain_id。 (if domain is a top domain, it's parent_domain_id will be 0 ) (如果domain是顶级域名,则parent_domain_id为0)

I work with PHP and mysql and my machine has 8GB of RAM so I have some device limitations. 我使用PHP和mysql,并且我的机器具有8GB的RAM,所以我有一些设备限制。 Is there a trick about checking a huge data set row by row with PHP? 使用PHP逐行检查庞大的数据集是否有技巧?

EDIT: for most domain names this should work. 编辑:对于大多数域名,这应该工作。

You could get a sorted list which contains all the domains (up to a fixed length) ordered by domain and increasing length. 您可能会得到一个排序的列表,其中包含按域和长度递增顺序排序的所有域(最大长度为固定长度)。

I would start with Comment of @RyanVincent. 我将从@RyanVincent的评论开始。

select domain_id, domain_name, reverse(domain_name) as reversed
from domain
order by 
rpad(reverse(domain_name),130,' '),
length(domain_name),
reverse(domain_name), 
domain_id

Goal is to be able to get row by row ordered by length and then by alphabet. 目标是能够按长度然后按字母顺序排列。

Your example would give 你的例子会给

        google.com -> moc.elgoog, 
      a.google.com -> moc.elgoog.a
   xy.a.google.com -> moc.elgoog.a.yx
      b.google.com -> moc.elgoog.a

In php: 在php中:

  $currentDomains = array(); 
  /*
     array domainpart => id
     moc     => 0
     elgoog  => id of google.com as long as we are in subdomains of google.com
     a       => id of a.google.com as long as we are in subdomains of a.google.com

     this gets never longer then the number of domainparts, so usually a very
     short array!
   */

  $sql = "select domain_id, domain_name, reverse(domain_name) as reversed\n"
      . " from domain \n"
      . " order by  \n"
      . " rpad(reverse(domain_name),130,' '), \n"
      . " length(domain_name), \n"
      . " reverse(domain_name),  \n"
      . " domain_id"
      ;

  doSelect($sql);

  while($row = getRow()){
    $parts = preg_split('/\./', $row["reversed"]);
    # print("parts = \n");print_r($parts);
    $rid = $row["domain_id"];

    $matchedDomains = array();
    $parentId = $rid; // if no parent, show to yourself

    $i = 0;
    // 1. match identical parts
    //     php is funny, the array is a key=>value but with 
    //     foreach it restores the key-values in the inserted order.
    foreach($currentDomains as $name => $cid){
      # print("> check $i '$name' '{$parts[$i]}'\n");
      if($parts[$i] == $name){
        $matchedDomains[$name] = $cid;
        if($cid > 0){
          $parentId = $cid;
        }
        $i++;
      }
      else{
        break;
      }
    }
    // 2.
    // new parts
    while ($i < count($parts)-1){
      # print("> store '{$parts[$i]}' \n");
      $matchedDomains[$parts[$i]] = 0; // no row matches those
      $i++;
    }
    $matchedDomains[$parts[count($parts)-1]] = $rid;
    $currentDomains = $matchedDomains;
    print(" update domain set parent_id = $parentId where id = $rid\n"); // use PDO
  }

So google.com is its own parent domain, a.google.com has google.com as parent and so on. 因此google.com是其自己的父域,a.google.com以google.com作为父域,依此类推。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM