简体   繁体   中英

Extract compressed text from MediaWiki database with PHP

A client of ours would like to have all the contents from a wiki site they ran for a while. They provided us the complete database of the 'mediawiki' software. We are trying to extract the articles from the 'text' table with php, without using the MediaWiki engine.

MediaWiki seems to zip the contents before putting it as a BLOB in the database. We can't find a way to extract it without the engine. I looked at the source code, but can't recreate how they extract the BLOB's.

Any suggestions how solve this?

From Text table :

old_flags

Comma-separated list of flags. Contains the following possible values:

\n┌──────────┬──────────────────────────────────────────────────────────────────┐ \n│ gzip │ Text is compressed with PHP's gzdeflate() function.   \n│ │ Note: If the $wgCompressRevisions option is on, new rows │ \n│ │ (=current revisions) will be gzipped transparently at save time.   \n│ │ Previous revisions can also be compressed by using the script │ \n│ │ compressOld.php │ \n├──────────┼──────────────────────────────────────────────────────────────────┤ \n│ utf-8 │ Text was stored as UTF-8.   \n│ │ Note: If the $wgLegacyEncoding option is on, rows *without* this │ \n│ │ flag will be converted to UTF-8 transparently at load time.   \n├──────────┼──────────────────────────────────────────────────────────────────┤ \n│ object │ Text field contained a serialized PHP object.   \n│ │ Note: The object either contains multiple versions compressed │ \n│ │ together to achieve a better compression ratio, or it refers to │ \n│ │ another row where the text can be found.   \n├──────────┼──────────────────────────────────────────────────────────────────┤ \n│ external │ Text was stored in an external location specified by old_text │ \n└──────────┴──────────────────────────────────────────────────────────────────┘ \n

https://www.mediawiki.org/wiki/Compression

Old entries marked with old_flags="gzip" have their old_text compressed with zlib's deflate algorithm, with no header bytes. PHP's gzinflate() will accept this text plainly; in Perl etc set the window size to -MAX_WSIZE to disable the header bytes.

Should be as simple as feeding the blob data into php's gzinflate() , according to the documentation.

Just a guess, but try it like this:

SELECT UNCOMPRESS(blobname)

By the way, I don't have experience with MediaWiki, but I hope to move you in the right direction

Check out this page for more information on MySQL's compression methods.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM