简体   繁体   中英

How to get Wikipedia “clean” content?

I'm using Mediawiki api in order to get content from Wikipedia pages. I've written a code which generates the next query (for example):

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=hawaii

Which retrieves only the leading paragraph from the Wikipdia page about Hawaii.

The problem is that as you might notice there are a lot of irrelevant substrings such as:

"[[Molokai|Moloka{{okina}}i]], [[Lanai|Lāna{{okina}}i]], [[Kahoolawe|Kaho{{okina}}olawe]], [[Maui]] and the [[Hawaii (island)|" .

All those barckets [[]] are not relevant , and I wonder whether there is an alegant method to pull only 'clean' content from such pages?

Thanks in advance.

You can get a clean HTML text from Wikipedia with this query:

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii

If you want just a plain text, without HTML, try this:

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii&explaintext

please try this:

$relevant = preg_replace('/[[.*?]]/', '', $string);

EDIT: just found this - hope it is helpful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM