简体   繁体   English

解析大100mb xml并将其存储到sqlite db中

[英]parse big 100mb xml and store it into sqlite db

I have a folder with many xml files each of 100mbs I want to parse it tag by tag and store it into sqlite database. 我有一个包含许多xml文件的文件夹,每个文件100mb我要按标签解析它并将其存储到sqlite数据库中。
Here is my example xml, It starts with <conversation> tag like this 75-80 conversation tags in 1 file. 这是我的示例xml,它以<conversation>标签开头,就像1个文件中的75-80个会话标签一样。 I need to fetch all tag info conversationID, LoginName, StartTime, CompanyName, EmailAddress, DateTime, AccountNumber, FirmNumber, MessageContent, EndTime and insert into table rows. 我需要获取所有标记信息conversationID,LoginName,StartTime,CompanyName,EmailAddress,DateTime,AccountNumber,FirmNumber,MessageContent,EndTime并插入表行。
How many tables I need ? 我需要几张桌子? I am just thinking to create one table with many columns to fill all data row by row based on conversationID. 我只是想创建一个包含许多列的表,以基于conversationID逐行填充所有数据。 Then my processing involves to count how many users in conversations, what message they send, what is their email id etc. 然后,我的处理涉及计算对话中的用户数,他们发送的消息,他们的电子邮件ID等。
Any xpath tags is easier to process or stax element processing ? 任何xpath标签都更容易处理或stax元素处理? No SAX or DOM because I always get outOfMemory error since it is huge data 没有SAX或DOM,因为我总是得到outOfMemory错误,因为它是巨大的数据

input xml file example 输入xml文件示例

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Data provided by xyz LP. -->
<FileDump>
<Version>IBXML 1.3</Version>
<Conversation Perspective=" " RoomType="P">
<RoomID>PCHAT-0x3000001CA8361</RoomID>
<StartTime>03/31/2016 13:39:01</StartTime>
<StartTimeUTC>1459431541</StartTimeUTC>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>ABC BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@ABC.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 13:39:01</DateTime>
<DateTimeUTC>1459431541</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantLeft InteractionType="H">
<User>
<LoginName>JAU31</LoginName>
<FirstName>JIMMY</FirstName>
<LastName>AU</LastName>
<UUID>8724958</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>ABC BANK (HONG KONG)</CompanyName>
<EmailAddress>JAU31@xyz.net</EmailAddress>
<CorporateEmailAddress>yiumingau@ABC.com</CorporateEmailAddress>
</User>
<DateTime>03/29/2016 10:45:47</DateTime>
<DateTimeUTC>1459248347</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 14:56:22</DateTime>
<DateTimeUTC>1459436182</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantLeft InteractionType="N" DeviceType="M">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 19:30:01</DateTime>
<DateTimeUTC>1459452601</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantLeft InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 19:33:56</DateTime>
<DateTimeUTC>1459452836</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 19:45:16</DateTime>
<DateTimeUTC>1459453516</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantLeft InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 23:08:09</DateTime>
<DateTimeUTC>1459465689</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 23:14:23</DateTime>
<DateTimeUTC>1459466063</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<Message InteractionType="N">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:10:57</DateTime>
<DateTimeUTC>1459469457</DateTimeUTC>
<Content>
abcdefgghhhhhh
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>WVU</LoginName>
<FirstName>WHEELOCK</FirstName>
<LastName>VU</LastName>
<UUID>8266852</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>WVU@xyz.net</EmailAddress>
<CorporateEmailAddress>WHEELOCKVU@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:14:05</DateTime>
<DateTimeUTC>1459469645</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantEntered InteractionType="N">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:29:19</DateTime>
<DateTimeUTC>1459470559</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<Message InteractionType="N">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:29:19</DateTime>
<DateTimeUTC>1459470559</DateTimeUTC>
<Content>
ajdakjgdljsgdsafhkafa
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:29:19</DateTime>
<DateTimeUTC>1459470559</DateTimeUTC>
<Content>
akjdgljsafdlshf;kdsjf
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N">
<User>
<LoginName>WVU</LoginName>
<FirstName>WHEELOCK</FirstName>
<LastName>VU</LastName>
<UUID>8266852</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>WVU@xyz.net</EmailAddress>
<CorporateEmailAddress>WHEELOCKVU@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:39:32</DateTime>
<DateTimeUTC>1459471172</DateTimeUTC>
<Content>
sagdksajdlsahd
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 01:01:27</DateTime>
<DateTimeUTC>1459472487</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<Message InteractionType="N">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 01:31:29</DateTime>
<DateTimeUTC>1459474289</DateTimeUTC>
<Content>
ajdslsahdsj;a
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 02:49:46</DateTime>
<DateTimeUTC>1459478986</DateTimeUTC>
<Content>
sagdkjsagdkjashdlasjd
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 02:49:46</DateTime>
<DateTimeUTC>1459478986</DateTimeUTC>
<Content>
jsdhkshdksjdlsjdlks
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 03:47:37</DateTime>
<DateTimeUTC>1459482457</DateTimeUTC>
<Content>
jshdkshdksjdlskld
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 03:47:37</DateTime>
<DateTimeUTC>1459482457</DateTimeUTC>
<Content>
aasasasasas
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<EndTime>04/01/2016 03:47:37</EndTime>
<EndTimeUTC>1459482457</EndTimeUTC>
</Conversation>
</FileDump>

Look like you should do 3 or 2 tables - conversations(conversationID, StartTime, EndTime), users(LoginName, CompanyName, EmailAddress, FirmNumber), messages(DateTime, MessageContent, AccountNumber) 看起来你应该做3或2个表 - 对话(conversationID,StartTime,EndTime),用户(LoginName,CompanyName,EmailAddress,FirmNumber),消息(DateTime,MessageContent,AccountNumber)

Once I do xml import with php, but it was php, and there was 1GB xml file. 一旦我用php进行xml导入,但它是php,并且有1GB的xml文件。 That strange that you have problems with java and with 100 mb xml. 奇怪的是你有java和100 mb xml的问题。 But if you have problems with memory, I can advice to you my decision - get file with common java classes, and read it line-by line (if it not possible for your case, char by char). 但是如果你有内存问题,我可以告诉你我的决定 - 获取普通java类的文件,并逐行读取(如果你的情况不可能,char by char)。 During this reading you should define start and end tag ( <User> and </User>) and read this data inside of your loop. 在此读取过程中,您应该定义开始和结束标记( <User> and </User>)并在循环内部读取此数据。 Maybe you will process each your file 3 times - first iteration to fetch all users, second to fetch all conversations and third to fetch all messages, but look like this is one-time procedure, so it should be ok for you. 也许你会处理你的每个文件3次 - 第一次迭代来获取所有用户,第二次获取所有对话,第三次获取所有消息,但看起来这是一次性程序,所以它应该没问题。

You should use StAX to parse the XML file and process it like this. 您应该使用StAX来解析XML文件并像这样处理它。

Read the initial part of the XML, validate it, then ignore it. 阅读XML的初始部分,验证它,然后忽略它。

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Data provided by xyz LP. -->
<FileDump>
    <Version>IBXML 1.3</Version>

Read the beginning of the Conversation: 阅读对话的开头:

<Conversation Perspective=" " RoomType="P">
    <RoomID>PCHAT-0x3000001CA8361</RoomID>
    <StartTime>03/31/2016 13:39:01</StartTime>
    <StartTimeUTC>1459431541</StartTimeUTC>

Create a new record in the Conversation table in the database, getting the ID of the new record back. 在数据库的Conversation表中创建一条新记录,获取新记录的ID。

Read a Participant entry, and save it in the Participant table (where Entered vs Left is a column): 读取参与者条目,并将其保存在Participant表中(Entered vs Left是一列):

<ParticipantEntered InteractionType="N" DeviceType="M">
    <User>
        <LoginName>SWONG00</LoginName>
        <FirstName>STEPHEN</FirstName>
        <LastName>WONG</LastName>
        <UUID>4397109</UUID>
        <FirmNumber>13133</FirmNumber>
        <AccountNumber>231115</AccountNumber>
        <CompanyName>ABC BANK LIMITED HON</CompanyName>
        <EmailAddress>SWONG00@xyz.net</EmailAddress>
        <CorporateEmailAddress>STEPHENWONGWE@ABC.COM</CorporateEmailAddress>
    </User>
    <DateTime>03/31/2016 13:39:01</DateTime>
    <DateTimeUTC>1459431541</DateTimeUTC>
    <ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>

Read a Message entry, and save it in the Message table: 读取一条消息条目,并将其保存在Message表中:

<Message InteractionType="N">
    <User>
        <LoginName>G_LO</LoginName>
        <FirstName>GARY</FirstName>
        <LastName>LO</LastName>
        <UUID>7054548</UUID>
        <FirmNumber>13133</FirmNumber>
        <AccountNumber>91189</AccountNumber>
        <CompanyName>abc BANK (HONG KONG)</CompanyName>
        <EmailAddress>G_LO@xyz.net</EmailAddress>
        <CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
    </User>
    <DateTime>04/01/2016 00:10:57</DateTime>
    <DateTimeUTC>1459469457</DateTimeUTC>
    <Content>
abcdefgghhhhhh
    </Content>
    <ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>

Keep reading and saving entries: <ParticipantEntered> , <ParticipantLeft> , and <Message> . 继续阅读和保存条目: <ParticipantEntered><ParticipantLeft><Message>

Read the end of the Conversation: 阅读对话的结尾:

    <EndTime>04/01/2016 03:47:37</EndTime>
    <EndTimeUTC>1459482457</EndTimeUTC>
</Conversation>

Update the Conversation record created earlier. 更新先前创建的Conversation记录。

Read and validate the end of the XML document: 阅读并验证XML文档的结尾:

</FileDump>

You're done, with very low memory footprint. 你已经完成了,内存占用很少。

Note: You might also have a 4th User table. 注意:您可能还有第4个User表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM