简体   繁体   中英

Wikipedia API returning extract without all characters in article?

Not sure if I should ask this here, but I can't figure it out.

I saw the issue first on Wikipedia's "Meme" article ( https://en.wikipedia.org/wiki/Meme ). There are several special characters for pronunciation that don't appear in the extract queried with the MediaWiki API ( https://en.wikipedia.org/w/api.php?format=jsonfm&action=query&prop=revisions|extracts&redirects=true&titles=meme ).

I couldn't find a solution in the MediaWiki API documentation or alternatives (I tried jsoup to parse the entire page but couldn't reliably get the content from the article that I need while the extract query does).

The extracts API tries to sanitize the text in various ways to make it more readable (you might have noticed that the sentences in italic preceding the pronunciation do not show either). Part of that is removing everything with the noexcerpt class, which includes the spelling. (In the future, text in parantheses might be removed completely to handle metadata creep .)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM