繁体   English   中英

解析大100mb xml并将其存储到sqlite db中

[英]parse big 100mb xml and store it into sqlite db

我有一个包含许多xml文件的文件夹,每个文件100mb我要按标签解析它并将其存储到sqlite数据库中。
这是我的示例xml,它以<conversation>标签开头,就像1个文件中的75-80个会话标签一样。 我需要获取所有标记信息conversationID,LoginName,StartTime,CompanyName,EmailAddress,DateTime,AccountNumber,FirmNumber,MessageContent,EndTime并插入表行。
我需要几张桌子? 我只是想创建一个包含许多列的表,以基于conversationID逐行填充所有数据。 然后,我的处理涉及计算对话中的用户数,他们发送的消息,他们的电子邮件ID等。
任何xpath标签都更容易处理或stax元素处理? 没有SAX或DOM,因为我总是得到outOfMemory错误,因为它是巨大的数据

输入xml文件示例

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Data provided by xyz LP. -->
<FileDump>
<Version>IBXML 1.3</Version>
<Conversation Perspective=" " RoomType="P">
<RoomID>PCHAT-0x3000001CA8361</RoomID>
<StartTime>03/31/2016 13:39:01</StartTime>
<StartTimeUTC>1459431541</StartTimeUTC>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>ABC BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@ABC.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 13:39:01</DateTime>
<DateTimeUTC>1459431541</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantLeft InteractionType="H">
<User>
<LoginName>JAU31</LoginName>
<FirstName>JIMMY</FirstName>
<LastName>AU</LastName>
<UUID>8724958</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>ABC BANK (HONG KONG)</CompanyName>
<EmailAddress>JAU31@xyz.net</EmailAddress>
<CorporateEmailAddress>yiumingau@ABC.com</CorporateEmailAddress>
</User>
<DateTime>03/29/2016 10:45:47</DateTime>
<DateTimeUTC>1459248347</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 14:56:22</DateTime>
<DateTimeUTC>1459436182</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantLeft InteractionType="N" DeviceType="M">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 19:30:01</DateTime>
<DateTimeUTC>1459452601</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantLeft InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 19:33:56</DateTime>
<DateTimeUTC>1459452836</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 19:45:16</DateTime>
<DateTimeUTC>1459453516</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantLeft InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 23:08:09</DateTime>
<DateTimeUTC>1459465689</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantLeft>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>03/31/2016 23:14:23</DateTime>
<DateTimeUTC>1459466063</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<Message InteractionType="N">
<User>
<LoginName>G_LO</LoginName>
<FirstName>GARY</FirstName>
<LastName>LO</LastName>
<UUID>7054548</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>G_LO@xyz.net</EmailAddress>
<CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:10:57</DateTime>
<DateTimeUTC>1459469457</DateTimeUTC>
<Content>
abcdefgghhhhhh
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>WVU</LoginName>
<FirstName>WHEELOCK</FirstName>
<LastName>VU</LastName>
<UUID>8266852</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>WVU@xyz.net</EmailAddress>
<CorporateEmailAddress>WHEELOCKVU@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:14:05</DateTime>
<DateTimeUTC>1459469645</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<ParticipantEntered InteractionType="N">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:29:19</DateTime>
<DateTimeUTC>1459470559</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<Message InteractionType="N">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:29:19</DateTime>
<DateTimeUTC>1459470559</DateTimeUTC>
<Content>
ajdakjgdljsgdsafhkafa
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:29:19</DateTime>
<DateTimeUTC>1459470559</DateTimeUTC>
<Content>
akjdgljsafdlshf;kdsjf
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N">
<User>
<LoginName>WVU</LoginName>
<FirstName>WHEELOCK</FirstName>
<LastName>VU</LastName>
<UUID>8266852</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>91189</AccountNumber>
<CompanyName>abc BANK (HONG KONG)</CompanyName>
<EmailAddress>WVU@xyz.net</EmailAddress>
<CorporateEmailAddress>WHEELOCKVU@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 00:39:32</DateTime>
<DateTimeUTC>1459471172</DateTimeUTC>
<Content>
sagdksajdlsahd
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<ParticipantEntered InteractionType="N" DeviceType="M">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 01:01:27</DateTime>
<DateTimeUTC>1459472487</DateTimeUTC>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>
<Message InteractionType="N">
<User>
<LoginName>SWONG00</LoginName>
<FirstName>STEPHEN</FirstName>
<LastName>WONG</LastName>
<UUID>4397109</UUID>
<FirmNumber>13133</FirmNumber>
<AccountNumber>231115</AccountNumber>
<CompanyName>abc BANK LIMITED HON</CompanyName>
<EmailAddress>SWONG00@xyz.net</EmailAddress>
<CorporateEmailAddress>STEPHENWONGWE@abc.COM</CorporateEmailAddress>
</User>
<DateTime>04/01/2016 01:31:29</DateTime>
<DateTimeUTC>1459474289</DateTimeUTC>
<Content>
ajdslsahdsj;a
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 02:49:46</DateTime>
<DateTimeUTC>1459478986</DateTimeUTC>
<Content>
sagdkjsagdkjashdlasjd
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 02:49:46</DateTime>
<DateTimeUTC>1459478986</DateTimeUTC>
<Content>
jsdhkshdksjdlsjdlks
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 03:47:37</DateTime>
<DateTimeUTC>1459482457</DateTimeUTC>
<Content>
jshdkshdksjdlskld
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<Message InteractionType="N" DeviceType="M">
<User>
<LoginName>FCHAN95</LoginName>
<FirstName>FLORENCE</FirstName>
<LastName>CHAN</LastName>
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName>
<EmailAddress>FCHAN95@xyz.net</EmailAddress>
<CorporateEmailAddress></CorporateEmailAddress>
</User>
<DateTime>04/01/2016 03:47:37</DateTime>
<DateTimeUTC>1459482457</DateTimeUTC>
<Content>
aasasasasas
</Content>
<ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>
<EndTime>04/01/2016 03:47:37</EndTime>
<EndTimeUTC>1459482457</EndTimeUTC>
</Conversation>
</FileDump>

看起来你应该做3或2个表 - 对话(conversationID,StartTime,EndTime),用户(LoginName,CompanyName,EmailAddress,FirmNumber),消息(DateTime,MessageContent,AccountNumber)

一旦我用php进行xml导入,但它是php,并且有1GB的xml文件。 奇怪的是你有java和100 mb xml的问题。 但是如果你有内存问题,我可以告诉你我的决定 - 获取普通java类的文件,并逐行读取(如果你的情况不可能,char by char)。 在此读取过程中,您应该定义开始和结束标记( <User> and </User>)并在循环内部读取此数据。 也许你会处理你的每个文件3次 - 第一次迭代来获取所有用户,第二次获取所有对话,第三次获取所有消息,但看起来这是一次性程序,所以它应该没问题。

您应该使用StAX来解析XML文件并像这样处理它。

阅读XML的初始部分,验证它,然后忽略它。

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Data provided by xyz LP. -->
<FileDump>
    <Version>IBXML 1.3</Version>

阅读对话的开头:

<Conversation Perspective=" " RoomType="P">
    <RoomID>PCHAT-0x3000001CA8361</RoomID>
    <StartTime>03/31/2016 13:39:01</StartTime>
    <StartTimeUTC>1459431541</StartTimeUTC>

在数据库的Conversation表中创建一条新记录,获取新记录的ID。

读取参与者条目,并将其保存在Participant表中(Entered vs Left是一列):

<ParticipantEntered InteractionType="N" DeviceType="M">
    <User>
        <LoginName>SWONG00</LoginName>
        <FirstName>STEPHEN</FirstName>
        <LastName>WONG</LastName>
        <UUID>4397109</UUID>
        <FirmNumber>13133</FirmNumber>
        <AccountNumber>231115</AccountNumber>
        <CompanyName>ABC BANK LIMITED HON</CompanyName>
        <EmailAddress>SWONG00@xyz.net</EmailAddress>
        <CorporateEmailAddress>STEPHENWONGWE@ABC.COM</CorporateEmailAddress>
    </User>
    <DateTime>03/31/2016 13:39:01</DateTime>
    <DateTimeUTC>1459431541</DateTimeUTC>
    <ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</ParticipantEntered>

读取一条消息条目,并将其保存在Message表中:

<Message InteractionType="N">
    <User>
        <LoginName>G_LO</LoginName>
        <FirstName>GARY</FirstName>
        <LastName>LO</LastName>
        <UUID>7054548</UUID>
        <FirmNumber>13133</FirmNumber>
        <AccountNumber>91189</AccountNumber>
        <CompanyName>abc BANK (HONG KONG)</CompanyName>
        <EmailAddress>G_LO@xyz.net</EmailAddress>
        <CorporateEmailAddress>garyloyc@abc.com</CorporateEmailAddress>
    </User>
    <DateTime>04/01/2016 00:10:57</DateTime>
    <DateTimeUTC>1459469457</DateTimeUTC>
    <Content>
abcdefgghhhhhh
    </Content>
    <ConversationID>PCHAT-0x3000001CA8361</ConversationID>
</Message>

继续阅读和保存条目: <ParticipantEntered><ParticipantLeft><Message>

阅读对话的结尾:

    <EndTime>04/01/2016 03:47:37</EndTime>
    <EndTimeUTC>1459482457</EndTimeUTC>
</Conversation>

更新先前创建的Conversation记录。

阅读并验证XML文档的结尾:

</FileDump>

你已经完成了,内存占用很少。

注意:您可能还有第4个User表。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM