简体   繁体   中英

parse big 100mb xml and store it into sqlite db

I have a folder with many xml files each of 100mbs I want to parse it tag by tag and store it into sqlite database.
Here is my example xml, It starts with <conversation> tag like this 75-80 conversation tags in 1 file. I need to fetch all tag info conversationID, LoginName, StartTime, CompanyName, EmailAddress, DateTime, AccountNumber, FirmNumber, MessageContent, EndTime and insert into table rows.
How many tables I need ? I am just thinking to create one table with many columns to fill all data row by row based on conversationID. Then my processing involves to count how many users in conversations, what message they send, what is their email id etc.
Any xpath tags is easier to process or stax element processing ? No SAX or DOM because I always get outOfMemory error since it is huge data

input xml file example

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Data provided by xyz LP. -->
<FileDump>
<Version>IBXML 1.3</Version>
<Conversation Perspective=" " RoomType="P">
<RoomID>PCHAT-0x3000001CA8361</RoomID>
<StartTime>03/31/2016 13:39:01</StartTime>
<StartTimeUTC>1459431541</StartTimeUTC>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>ABC BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@ABC.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 13:39:01</DateTime>
<DateTimeUTC>1459431541</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantLeft InteractionType="H">
<User>
<LoginName>JAU31</LoginName>
<FirstName>JIMMY</FirstName>
<LastName>AU</LastName>
<UUID>8724958</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>ABC BANK (HONG KONG)</CompanyName>
<EmailAddress>JAU31@xyz.net</EmailAddress>
<CorporateEmailAddress>yiumingau@ABC.com</CorporateEmailAddress>
</User>
<DateTime>03/29/2016 10:45:47</DateTime>
<DateTimeUTC>1459248347</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 14:56:22</DateTime>
<DateTimeUTC>1459436182</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantLeft InteractionType="N" DeviceType="M">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 19:30:01</DateTime>
<DateTimeUTC>1459452601</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantLeft InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 19:33:56</DateTime>
<DateTimeUTC>1459452836</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 19:45:16</DateTime>
<DateTimeUTC>1459453516</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantLeft InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 23:08:09</DateTime>
<DateTimeUTC>1459465689</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 23:14:23</DateTime>
<DateTimeUTC>1459466063</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<Message InteractionType="N">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:10:57</DateTime>
<DateTimeUTC>1459469457</DateTimeUTC>
<Content>
abcdefgghhhhhh
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>WVU</LoginName>
<FirstName>WHEELOCK</FirstName>
<LastName>VU</LastName>
<UUID>8266852</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>WVU@xyz.net</EmailAddress>
<CorporateEmailAddress>WHEELOCKVU@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:14:05</DateTime>
<DateTimeUTC>1459469645</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantEntered InteractionType="N">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:29:19</DateTime>
<DateTimeUTC>1459470559</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<Message InteractionType="N">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:29:19</DateTime>
<DateTimeUTC>1459470559</DateTimeUTC>
<Content>
ajdakjgdljsgdsafhkafa
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:29:19</DateTime>
<DateTimeUTC>1459470559</DateTimeUTC>
<Content>
akjdgljsafdlshf;kdsjf
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N">
<User>
<LoginName>WVU</LoginName>
<FirstName>WHEELOCK</FirstName>
<LastName>VU</LastName>
<UUID>8266852</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>WVU@xyz.net</EmailAddress>
<CorporateEmailAddress>WHEELOCKVU@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:39:32</DateTime>
<DateTimeUTC>1459471172</DateTimeUTC>
<Content>
sagdksajdlsahd
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 01:01:27</DateTime>
<DateTimeUTC>1459472487</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<Message InteractionType="N">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 01:31:29</DateTime>
<DateTimeUTC>1459474289</DateTimeUTC>
<Content>
ajdslsahdsj;a
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 02:49:46</DateTime>
<DateTimeUTC>1459478986</DateTimeUTC>
<Content>
sagdkjsagdkjashdlasjd
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 02:49:46</DateTime>
<DateTimeUTC>1459478986</DateTimeUTC>
<Content>
jsdhkshdksjdlsjdlks
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 03:47:37</DateTime>
<DateTimeUTC>1459482457</DateTimeUTC>
<Content>
jshdkshdksjdlskld
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 03:47:37</DateTime>
<DateTimeUTC>1459482457</DateTimeUTC>
<Content>
aasasasasas
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<EndTime>04/01/2016 03:47:37</EndTime>
<EndTimeUTC>1459482457</EndTimeUTC>
</Conversation>
</FileDump>

Look like you should do 3 or 2 tables - conversations(conversationID, StartTime, EndTime), users(LoginName, CompanyName, EmailAddress, FirmNumber), messages(DateTime, MessageContent, AccountNumber)

Once I do xml import with php, but it was php, and there was 1GB xml file. That strange that you have problems with java and with 100 mb xml. But if you have problems with memory, I can advice to you my decision - get file with common java classes, and read it line-by line (if it not possible for your case, char by char). During this reading you should define start and end tag ( <User> and </User>) and read this data inside of your loop. Maybe you will process each your file 3 times - first iteration to fetch all users, second to fetch all conversations and third to fetch all messages, but look like this is one-time procedure, so it should be ok for you.

You should use StAX to parse the XML file and process it like this.

Read the initial part of the XML, validate it, then ignore it.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Data provided by xyz LP. -->
<FileDump>
    <Version>IBXML 1.3</Version>

Read the beginning of the Conversation:

<Conversation Perspective=" " RoomType="P">
    <RoomID>PCHAT-0x3000001CA8361</RoomID>
    <StartTime>03/31/2016 13:39:01</StartTime>
    <StartTimeUTC>1459431541</StartTimeUTC>

Create a new record in the Conversation table in the database, getting the ID of the new record back.

Read a Participant entry, and save it in the Participant table (where Entered vs Left is a column):

<ParticipantEntered InteractionType="N" DeviceType="M">
    <User>
        <LoginName>SWONG00</LoginName>
        <FirstName>STEPHEN</FirstName>
        <LastName>WONG</LastName>
        <UUID>4397109</UUID>
        <FirmNumber>13133</FirmNumber>
        <AccountNumber>231115</AccountNumber>
        <CompanyName>ABC BANK LIMITED HON</CompanyName>
        <EmailAddress>SWONG00@xyz.net</EmailAddress>
        <CorporateEmailAddress>STEPHENWONGWE@ABC.COM</CorporateEmailAddress>
    </User>
    <DateTime>03/31/2016 13:39:01</DateTime>
    <DateTimeUTC>1459431541</DateTimeUTC>
    <ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>

Read a Message entry, and save it in the Message table:

<Message InteractionType="N">
    <User>
        <LoginName>G_LO</LoginName>
        <FirstName>GARY</FirstName>
        <LastName>LO</LastName>
        <UUID>7054548</UUID>
        <FirmNumber>13133</FirmNumber>
        <AccountNumber>91189</AccountNumber>
        <CompanyName>abc BANK (HONG KONG)</CompanyName>
        <EmailAddress>G_LO@xyz.net</EmailAddress>
        <CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
    </User>
    <DateTime>04/01/2016 00:10:57</DateTime>
    <DateTimeUTC>1459469457</DateTimeUTC>
    <Content>
abcdefgghhhhhh
    </Content>
    <ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>

Keep reading and saving entries: <ParticipantEntered> , <ParticipantLeft> , and <Message> .

Read the end of the Conversation:

    <EndTime>04/01/2016 03:47:37</EndTime>
    <EndTimeUTC>1459482457</EndTimeUTC>
</Conversation>

Update the Conversation record created earlier.

Read and validate the end of the XML document:

</FileDump>

You're done, with very low memory footprint.

Note: You might also have a 4th User table.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM