简体   繁体   English

使用PHP将Wiktionary XML数据转储到MySQL数据库中

[英]Parse Wiktionary XML data dump into MySQL database using PHP

Alright, I'm just trying to parse Wiktionary Data Dump provided by Wikimedia . 好的,我只是想解析Wikimedia提供的Wiktionary Data Dump

My intention is to parse that XML data dump into MySQL database. 我的意图是解析XML数据转储到MySQL数据库中。 I didn't find proper documentation regarding the structure of this XML. 我没有找到有关此XML结构的适当文档。 Also, I'm not able to open the file because it's infact really huge (~1 GB). 另外,我无法打开该文件,因为它实际上非常大(〜1 GB)。

I thought of parsing it using some PHP script but I don't have any idea about the XML structure to proceed. 我曾想过使用一些PHP脚本来解析它,但是我对要进行的XML结构一无所知。 So If anyone had already parsed (or have idea about any tool to parse) into MySQL using PHP, Please share the details. 因此,如果有人已经使用PHP解析了MySQL(或对解析任何工具有想法),请分享详细信息。 If nothing in PHP, Other methods are also fine. 如果PHP中没有任何内容,则其他方法也可以。

I just followed this post ( http://www.igrec.ca/lexicography/installing-a-local-copy-of-wiktionary-mysql/ ) but it didn't work out..:( If anybody have succeed in this process, please help. Thanks in Advance. 我只是关注了这篇文章( http://www.igrec.ca/lexicography/installing-a-local-copy-of-wiktionary-mysql/ ),但是它没有成功.. :(如果有人在此方面取得了成功过程中,请提供帮助。

Those files can be parsed in PHP with XMLReader operating on a compress.bzip2:// stream . 可以使用在compress.bzip2://上运行的XMLReader在PHP中解析这些文件。 The structure of the file you have is exemplary (peeking into ca. the first 3000 elements): 您拥有的文件的结构是示例性的(查看前3000个元素):

\-mediawiki (1)
  |-siteinfo (1)
  | |-sitename (1)
  | |-base (1)
  | |-generator (1)
  | |-case (1)
  | \-namespaces (1)
  |   \-namespace (40)
  \-page (196)
    |-title (196)
    |-ns (196)
    |-id (196)
    |-restrictions (2)
    |-revision (196)
    | |-id (196)
    | |-parentid (194)
    | |-timestamp (196)
    | |-contributor (196)
    | | |-username (182)
    | | |-id (182)
    | | \-ip (14)
    | |-comment (183)
    | |-text (195)
    | |-sha1 (195)
    | |-model (195)
    | |-format (195)
    | \-minor (99)
    \-redirect (5)

The file itself is a little larger, so it takes quite some time to process. 该文件本身稍大,因此需要花费很多时间来处理。 Alternatively do not operate on the XML dumps, but just import the SQL dumps via the mysql commandline tool. 另外,也可以不对XML转储进行操作,而只是通过mysql命令行工具导入SQL转储。 SQL dumps are available on the site as well, see all dump formats for the English Wiktionary : 该站点上也提供SQL转储,请参见英语Wiktionary的所有转储格式:


The overall file was a litte larger with more than 66 849 000 elements: 总体文件较小,包含66 849 000个元素:

\-mediawiki (1)
  |-siteinfo (1)
  | |-sitename (1)
  | |-base (1)
  | |-generator (1)
  | |-case (1)
  | \-namespaces (1)
  |   \-namespace (40)
  \-page (3993913)
    |-title (3993913)
    |-ns (3993913)
    |-id (3993913)
    |-restrictions (552)
    |-revision (3993913)
    | |-id (3993913)
    | |-parentid (3572237)
    | |-timestamp (3993913)
    | |-contributor (3993913)
    | | |-username (3982087)
    | | |-id (3982087)
    | | \-ip (11824)
    | |-comment (3917241)
    | |-text (3993913)
    | |-sha1 (3993913)
    | |-model (3993913)
    | |-format (3993913)
    | \-minor (3384811)
    |-redirect (27340)
    \-DiscussionThreading (4698)
      |-ThreadSubject (4698)
      |-ThreadPage (4698)
      |-ThreadID (4698)
      |-ThreadAuthor (4698)
      |-ThreadEditStatus (4698)
      |-ThreadType (4698)
      |-ThreadSignature (4698)
      |-ThreadParent (3605)
      |-ThreadAncestor (3605)
      \-ThreadSummaryPage (11)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM