简体   繁体   中英

Strip all hyperlinks that appear before the text (Wikipedia dump)

I am working on the wikipedia dump to find some useful information related to the first sentence that appeared on the first paragraph. The dump is highly unstructured as they have saved almost all of the information in one column. They use combination of symbols to recognize how the text will be displayed like when some word is written as '''word''' it would appear as bold. Same situation is with the hyperlinks, they use [[ ]] for hyperlinks. Now as i want the first sentence (including the hyperlinks) of the first paragraph i need to strip all extra information which doesn't represent text. I did so using

 preg_replace("#\{{.*?}\}#s","", $text)

Sample text (from wikipedia dump):

{{Ver desambig}}
{{Mais notas||ci|data=janeiro de 2013}}
{{Info/Taxonomia}}
[[Ficheiro:Pêra amarela.JPG|thumbnail|upright]] //image link which i don't want

A {{AO-pAO|pera|pêra}} é o fruto comestível da pereira, uma [[árvore]] do. //first sentence of first paragraph

I stripped all {{ }} so i am just left with:

[[Ficheiro:Pêra amarela.JPG|thumbnail|upright]]
A {{AO-pAO|pera|pêra}} é o fruto comestível da pereira, uma [[árvore]] do.

Now if you see here are two hyper links (hyperlinks are represented by [[ ]]). I want to keep the one that appeared inside the first sentence ie árvore but i don't want any sequence of [[ ]] before that. I tried stripping the [[ ]] from text but that strips out árvore too which i don't want.

PS: There might be more than one hyperlinks before the starting of the first sentence. Can that be done through the regex? I am using php. Thanks

Use the below regex and then replace the matched chars with empty string.

(?s)^(?:\s*{{.*?}}|\s*\[\[.*?]])*\n?

^ in DOTALL mode (?s) , matches the start of very first line.

DEMO

You might want to use a wiki syntax parser and modify it for your needs.

http://www.mediawiki.org/wiki/Alternative_parsers

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM