简体   繁体   English

在不影响现有记录的情况下将新的XML数据导入MySQL表

[英]Import new XML data into MySQL table without affecting existing records

I have a very large (2.7mb) XML file with following structure: 我有一个很大的(2.7mb)XML文件,其结构如下:

<?xml version="1.0"?>

<Destinations>

  <Destination>
    <DestinationId>W4R1FG</DestinationId>
    <Country>Pakistan</Country>
    <City>Karachi</City>
    <State>Sindh</State>
  </Destination>

  <Destination>
    <DestinationId>D2C2FV</DestinationId>
    <Country>Turkey</Country>
    <City>Istanbul</City>
    <State>Istanbul</State>
  </Destination>

  <Destination>
    <DestinationId>5TFV3E</DestinationId>
    <Country>Canada</Country>
    <City>Toronto</City>
    <State>Ontario</State>
  </Destination>  

  ... ... ...

</Destinations>

And a MySQL table "destinations" like this: 和一个MySQL表“目的地”是这样的:

+---+--------------+----------+---------+----------+
|id |DestinationId |Country   |City     |State     |
+---+--------------+----------+---------+----------+
|1  |W4R1FG        |Pakistan  |Karachi  |Sindh     |
+---+--------------+----------+---------+----------+
|2  |D2C2FV        |Turkey    |Istanbul |Istanbul  |
+---+--------------+----------+---------+----------+
|3  |5TFV3E        |Canada    |Toronto  |Ontario   |
+---+--------------+----------+---------+----------+
|.  |......        |......    |.......  |.......   |
+---+--------------+----------+---------+----------+

Now I want to process my XML and check for each destination record in MySQL table. 现在,我要处理我的XML并检查MySQL表中的每个目标记录。 I have to compare only DestinationId against each record and check whether it exists in my DB table or not. 我只需要将DestinationId与每个记录进行比较,并检查它是否存在于我的数据库表中。 If it does exist leave that record and move on, and if it doesn't exist then execute an INSERT query to insert that record in that table. 如果确实存在,则保留该记录并继续,如果不存在,则执行INSERT查询以将该记录插入该表中。

I first tried to accomplish this using PHP foreach loop mechanism, but since data is so huge, it caused me serious performance and speed issues. 我首先尝试使用PHP foreach循环机制来完成此任务,但是由于数据如此之大,因此导致了严重的性能和速度问题。 Then I came up with a MySQL Procedure approach like this: 然后我想出了一个MySQL Procedure方法,例如:

DELIMITER $$

USE `destinations`$$

DROP PROCEDURE IF EXISTS `p_import_destinations`$$

CREATE DEFINER=`root`@`localhost` PROCEDURE `p_import_destinations`(
    p_xml                     TEXT
)
BEGIN
    DECLARE v_row_index INT UNSIGNED DEFAULT 0;
    DECLARE v_row_count INT UNSIGNED;
    DECLARE v_xpath_row VARCHAR(255);

    -- calculate the number of row elements.
    SET v_row_count := extractValue(p_xml,'count(/Destinations/Destination)');

    -- loop through all the row elements
    WHILE v_row_index < v_row_count DO        
        SET v_row_index := v_row_index + 1;
        SET v_xpath_row := CONCAT('/Destinations/Destination[',v_row_index,']');

    INSERT IGNORE INTO destinations VALUES (
        NULL,
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::DestinationId')),
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::Country')),
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::City')),
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::State'))
    );


    END WHILE;

END$$  

DELIMITER ;

Query to call this procedure: 查询以调用此过程:

SET @xml := LOAD_FILE('C:/Users/Muhammad Ali/Desktop/dest.xml'); 
CALL p_import_destinations(@xml);

This worked perfect but I am still not sure about this approach's scalability, performance and speed. 这非常完美,但是我仍然不确定这种方法的可扩展性,性能和速度。 And IGNORE clause used in this procedure skips through duplicate record but accumulates the auto increment key value. 并且此过程中使用的IGNORE子句跳过重复记录,但累积自动递增键值。 Like if it is checking row with id 3306 , if this record is a duplicate, it will not insert this in the table (which is a good thing) but will take the auto increment key 3307 , and when next time it inserts the NON-DUPLICATING record it will insert it at 3308 . 就像它正在检查id 3306行一样,如果该记录是重复记录,则不会将其插入到表中(这是一件好事),但是会使用自动递增键3307 ,并且下次插入NON-复制记录将在3308将其插入。 This don't seems good. 这似乎不好。

Any other approach(s) to meet such requirement would be much appreciated. 满足这种要求的任何其他方法将不胜感激。 And please guide me if I am ok to go on with this solution? 如果我可以继续使用该解决方案,请指导我。 If not, why? 如果没有,为什么?

Just remember, I am dealing with a very huge amount of data. 请记住,我正在处理大量数据。

This worked perfect but I am still not sure about this approach's scalability, performance and speed. 这非常完美,但是我仍然不确定这种方法的可扩展性,性能和速度。

Metric the speed, test how it scales. 衡量速度,测试其缩放比例。 Then you're sure. 那你确定。 Ask again if you find a problem that would hurt you in your scenario, but make the performance / scalability problem more concrete. 再次询问是否找到对您造成伤害的问题,但会使性能/可伸缩性问题更加具体。 Most likely such part then has been Q&A'ed already. 这样的部分很可能已经进行了问答。 If not on Stackoverflow here but on the DBA site: https://dba.stackexchange.com/ 如果不是在此处的Stackoverflow上,而是在DBA站点上: https : //dba.stackexchange.com/

And IGNORE clause used in this procedure skips through duplicate record but accumulates the auto increment key value 并且此过程中使用的IGNORE子句会跳过重复的记录,但会累积自动递增键值

This is similarily. 类似地。 If those gaps are a problem for you, this normally shows a flaw in your database design, because those gaps are normally meaningless (compare: How to fill in the "holes" in auto-incremenet fields? ). 如果这些差距对您来说是个问题,那么这通常表明您的数据库设计存在缺陷,因为这些差距通常是没有意义的(比较: 如何填充自动增量字段中的“漏洞”? )。

However, that won't mean others wouldn't have had that problem as well. 但是,这并不意味着其他人也不会遇到这个问题。 You can find a lot of material for that, also "tricks" how to prevent that with specific versions of your database server. 您可以找到很多相关的材料,还“窍门”如何使用特定版本的数据库服务器来防止这种情况。 But honestly , I wouldn't care about gaps. 但老实说,我不会在意差距。 The contract is that the identity column has a unique value. 合同是,标识列具有唯一值。 And that's all. 就这样。

In any case, both for performance and IDs: Why don't you take the processing apart? 在任何情况下,无论是性能还是ID方面:为什么不将处理分开? First import from the XML into an import table, then you could easily remove every row you don't want to import from that import table and then you can insert into the destination table as needed. 首先从XML导入到导入表中,然后可以轻松地从该导入表中删除不想导入的每一行,然后可以根据需要插入目标表中。

Solved this using another logic described below.. 使用下面描述的另一种逻辑解决了这个问题。

DELIMITER $$

USE `test`$$

DROP PROCEDURE IF EXISTS `import_destinations_xml`$$

CREATE DEFINER=`root`@`localhost` PROCEDURE `import_destinations_xml`(
    path VARCHAR(255), 
    node VARCHAR(255)
)

BEGIN
    DECLARE xml_content TEXT;
    DECLARE v_row_index INT UNSIGNED DEFAULT 0;   
    DECLARE v_row_count INT UNSIGNED;  
    DECLARE v_xpath_row VARCHAR(255); 

    -- set xml content.
    SET xml_content = LOAD_FILE(path);

    -- calculate the number of row elements.   
    SET v_row_count  = extractValue(xml_content, CONCAT('count(', node, ')')); 

    -- create a temporary destinations table
    DROP TABLE IF EXISTS `destinations_temp`;
    CREATE TABLE `destinations_temp` (
      `id` INT(11) NOT NULL AUTO_INCREMENT,
      `DestinationId` VARCHAR(32) DEFAULT NULL,
      `Country` VARCHAR(255) DEFAULT NULL,
      `City` VARCHAR(255) DEFAULT NULL,
      `State` VARCHAR(255) DEFAULT NULL,
    PRIMARY KEY (`id`)
    ) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;  

    -- loop through all the row elements    
    WHILE v_row_index < v_row_count DO                
        SET v_row_index = v_row_index + 1;        
        SET v_xpath_row = CONCAT(node, '[', v_row_index, ']');
        INSERT INTO destinations_temp VALUES (
            NULL,
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::DestinationId')),
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::Country')),
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::City')),
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::State'))
        );
    END WHILE;

    -- delete existing records from temporary destinations table
    DELETE FROM destinations_temp WHERE DestinationId IN (SELECT DestinationId FROM destinations);

    -- insert remaining (unmatched) records from temporary destinations table to destinations table
    INSERT INTO destinations (DestinationId, Country, City, State) 
    SELECT DestinationId, Country, City, State 
    FROM destinations_temp;

    -- creating a log file    
    SELECT  *
    INTO OUTFILE 'C:/Users/Muhammad Ali/Desktop/Destination_Import_Procedure/log/destinations_log.csv'
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\r\n'
    FROM `destinations_temp`;

    -- removing temporary destinations table
    DROP TABLE destinations_temp;

END$$

DELIMITER ;

Query to call this procedure. 查询以调用此过程。

CALL import_destinations_xml('C:\Users\Muhammad Ali\Desktop\Destination_Import_Procedure\dest.xml', '/Destinations/Destination');

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM