简体   繁体   中英

Getting a specific “page” from the Wikipedia XML dump

OK, so this is what I need :

  • I have downloaded and extracted the full Wikipedia XML dump (>40GB, single XML file)
  • I need to retrieve one particular <page> element (eg the page for the entry "Italy")

How can I do this? (Preferably with PHP code or some existing tool)

There is no guarantee that the full content of the page will be sequentially located, revisions might be anywhere in the same file or even in different XML files.

Please use or the web API's action=export at worst Special:Export . Not adding a link here because the output is huge.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM