简体   繁体   中英

Import new XML data into MySQL table without affecting existing records

I have a very large (2.7mb) XML file with following structure:

<?xml version="1.0"?>

<Destinations>

  <Destination>
    <DestinationId>W4R1FG</DestinationId>
    <Country>Pakistan</Country>
    <City>Karachi</City>
    <State>Sindh</State>
  </Destination>

  <Destination>
    <DestinationId>D2C2FV</DestinationId>
    <Country>Turkey</Country>
    <City>Istanbul</City>
    <State>Istanbul</State>
  </Destination>

  <Destination>
    <DestinationId>5TFV3E</DestinationId>
    <Country>Canada</Country>
    <City>Toronto</City>
    <State>Ontario</State>
  </Destination>  

  ... ... ...

</Destinations>

And a MySQL table "destinations" like this:

+---+--------------+----------+---------+----------+
|id |DestinationId |Country   |City     |State     |
+---+--------------+----------+---------+----------+
|1  |W4R1FG        |Pakistan  |Karachi  |Sindh     |
+---+--------------+----------+---------+----------+
|2  |D2C2FV        |Turkey    |Istanbul |Istanbul  |
+---+--------------+----------+---------+----------+
|3  |5TFV3E        |Canada    |Toronto  |Ontario   |
+---+--------------+----------+---------+----------+
|.  |......        |......    |.......  |.......   |
+---+--------------+----------+---------+----------+

Now I want to process my XML and check for each destination record in MySQL table. I have to compare only DestinationId against each record and check whether it exists in my DB table or not. If it does exist leave that record and move on, and if it doesn't exist then execute an INSERT query to insert that record in that table.

I first tried to accomplish this using PHP foreach loop mechanism, but since data is so huge, it caused me serious performance and speed issues. Then I came up with a MySQL Procedure approach like this:

DELIMITER $$

USE `destinations`$$

DROP PROCEDURE IF EXISTS `p_import_destinations`$$

CREATE DEFINER=`root`@`localhost` PROCEDURE `p_import_destinations`(
    p_xml                     TEXT
)
BEGIN
    DECLARE v_row_index INT UNSIGNED DEFAULT 0;
    DECLARE v_row_count INT UNSIGNED;
    DECLARE v_xpath_row VARCHAR(255);

    -- calculate the number of row elements.
    SET v_row_count := extractValue(p_xml,'count(/Destinations/Destination)');

    -- loop through all the row elements
    WHILE v_row_index < v_row_count DO        
        SET v_row_index := v_row_index + 1;
        SET v_xpath_row := CONCAT('/Destinations/Destination[',v_row_index,']');

    INSERT IGNORE INTO destinations VALUES (
        NULL,
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::DestinationId')),
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::Country')),
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::City')),
        extractValue(p_xml,CONCAT(v_xpath_row, '/child::State'))
    );


    END WHILE;

END$$  

DELIMITER ;

Query to call this procedure:

SET @xml := LOAD_FILE('C:/Users/Muhammad Ali/Desktop/dest.xml'); 
CALL p_import_destinations(@xml);

This worked perfect but I am still not sure about this approach's scalability, performance and speed. And IGNORE clause used in this procedure skips through duplicate record but accumulates the auto increment key value. Like if it is checking row with id 3306 , if this record is a duplicate, it will not insert this in the table (which is a good thing) but will take the auto increment key 3307 , and when next time it inserts the NON-DUPLICATING record it will insert it at 3308 . This don't seems good.

Any other approach(s) to meet such requirement would be much appreciated. And please guide me if I am ok to go on with this solution? If not, why?

Just remember, I am dealing with a very huge amount of data.

This worked perfect but I am still not sure about this approach's scalability, performance and speed.

Metric the speed, test how it scales. Then you're sure. Ask again if you find a problem that would hurt you in your scenario, but make the performance / scalability problem more concrete. Most likely such part then has been Q&A'ed already. If not on Stackoverflow here but on the DBA site: https://dba.stackexchange.com/

And IGNORE clause used in this procedure skips through duplicate record but accumulates the auto increment key value

This is similarily. If those gaps are a problem for you, this normally shows a flaw in your database design, because those gaps are normally meaningless (compare: How to fill in the "holes" in auto-incremenet fields? ).

However, that won't mean others wouldn't have had that problem as well. You can find a lot of material for that, also "tricks" how to prevent that with specific versions of your database server. But honestly , I wouldn't care about gaps. The contract is that the identity column has a unique value. And that's all.

In any case, both for performance and IDs: Why don't you take the processing apart? First import from the XML into an import table, then you could easily remove every row you don't want to import from that import table and then you can insert into the destination table as needed.

Solved this using another logic described below..

DELIMITER $$

USE `test`$$

DROP PROCEDURE IF EXISTS `import_destinations_xml`$$

CREATE DEFINER=`root`@`localhost` PROCEDURE `import_destinations_xml`(
    path VARCHAR(255), 
    node VARCHAR(255)
)

BEGIN
    DECLARE xml_content TEXT;
    DECLARE v_row_index INT UNSIGNED DEFAULT 0;   
    DECLARE v_row_count INT UNSIGNED;  
    DECLARE v_xpath_row VARCHAR(255); 

    -- set xml content.
    SET xml_content = LOAD_FILE(path);

    -- calculate the number of row elements.   
    SET v_row_count  = extractValue(xml_content, CONCAT('count(', node, ')')); 

    -- create a temporary destinations table
    DROP TABLE IF EXISTS `destinations_temp`;
    CREATE TABLE `destinations_temp` (
      `id` INT(11) NOT NULL AUTO_INCREMENT,
      `DestinationId` VARCHAR(32) DEFAULT NULL,
      `Country` VARCHAR(255) DEFAULT NULL,
      `City` VARCHAR(255) DEFAULT NULL,
      `State` VARCHAR(255) DEFAULT NULL,
    PRIMARY KEY (`id`)
    ) ENGINE=INNODB AUTO_INCREMENT=1 DEFAULT CHARSET=latin1;  

    -- loop through all the row elements    
    WHILE v_row_index < v_row_count DO                
        SET v_row_index = v_row_index + 1;        
        SET v_xpath_row = CONCAT(node, '[', v_row_index, ']');
        INSERT INTO destinations_temp VALUES (
            NULL,
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::DestinationId')),
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::Country')),
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::City')),
            extractValue(xml_content, CONCAT(v_xpath_row, '/child::State'))
        );
    END WHILE;

    -- delete existing records from temporary destinations table
    DELETE FROM destinations_temp WHERE DestinationId IN (SELECT DestinationId FROM destinations);

    -- insert remaining (unmatched) records from temporary destinations table to destinations table
    INSERT INTO destinations (DestinationId, Country, City, State) 
    SELECT DestinationId, Country, City, State 
    FROM destinations_temp;

    -- creating a log file    
    SELECT  *
    INTO OUTFILE 'C:/Users/Muhammad Ali/Desktop/Destination_Import_Procedure/log/destinations_log.csv'
    FIELDS TERMINATED BY ','
    LINES TERMINATED BY '\r\n'
    FROM `destinations_temp`;

    -- removing temporary destinations table
    DROP TABLE destinations_temp;

END$$

DELIMITER ;

Query to call this procedure.

CALL import_destinations_xml('C:\Users\Muhammad Ali\Desktop\Destination_Import_Procedure\dest.xml', '/Destinations/Destination');

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM