简体   繁体   English

带有wiki文本的Python正则表达式

[英]Python regular expression with wiki text

I'm trying to change wikitext into normal text using Python regular expressions substitution. 我正在尝试使用Python正则表达式替换将wikitext更改为普通文本。 There are two formatting rules regarding wiki link. 关于wiki链接有两种格式规则。

  • [[Name of page]] [[页面名称]]
  • [[Name of page | [[页面名称| Text to display]] 要显示的文字]]

    (http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet) (http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet)

Here is some text that gives me a headache. 这是一些让我头疼的文字。

The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally. CD几乎全部由[[披头士乐队]]的[[封面版]]歌曲组成,其中George Martin [[唱片制作人]制作]最初。

The text above should be changed into: 上述文字应改为:

The CD is composed almost entirely of cover versions of The Beatles songs which George Martin produced originally. 该CD几乎完全由乔治·马丁最初制作的披头士歌曲的封面版本组成。

The conflict between [[ ]] and [[ | [[]]和[[|]之间的冲突 ]] grammar is my main problem. ]]语法是我的主要问题。 I don't need one complex regular expression. 我不需要一个复杂的正则表达式。 Applying multiple (maybe two) regular expression substitution(s) in sequence is ok. 按顺序应用多个(可能是两个)正则表达式替换是可以的。

Please enlighten me on this problem. 请赐教我这个问题。

wikilink_rx = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
return wikilink_rx.sub(r'\1', the_string)

Example: http://ideone.com/7oxuz 示例: http//ideone.com/7oxuz

Note: you may also find some MediaWiki parsers in http://www.mediawiki.org/wiki/Alternative_parsers . 注意:您还可以在http://www.mediawiki.org/wiki/Alternative_parsers中找到一些MediaWiki解析器。

You're going down the wrong path. 你走错了路。 Wiki markup is notoriously hard to parse, and there are so many exceptions, edge cases and just plain busted markup that building your own regexps to do it is near-impossible. 众所周知,Wiki标记很难解析,并且有很多例外,边缘情况和简单的破坏标记,构建自己的正则表达式几乎是不可能的。 Since you're using Python, I'd suggest mwlib, which will do the hard work for you: 既然您正在使用Python,我建议使用mwlib,它将为您付出艰苦的努力:

http://code.pediapress.com/wiki/wiki/mwlib http://code.pediapress.com/wiki/wiki/mwlib

I came up with a regex which should do the trick. 我想出了一个正则表达式应该做的伎俩。 Let me know if there's anything wrong with it: 如果它有任何问题,请告诉我:

r"\[\[(([^\]|]|\](?=[^\]]))*)(\|(([^\]]|\](?=[^\]]))*))?\]\]"

(Ick, I will never get over how ugly these things are!) (Ick,我永远不会知道这些东西有多丑!)

Group 1 should give you the wiki link. 第1组应该给你wiki链接。 Group 4 should give you the link text, or None if there is no pipe. 第4组应该给你链接文本,如果没有管道,则为None。

An explanation: 一个解释:

  • (([^\\]|]|\\](?=[^\\]]))*) finds all sequences of characters which are not "|" (([^\\]|]|\\](?=[^\\]]))*)查找不是“|”的所有字符序列 or "]]". 要么 ”]]”。 It does this by finding all sequences of characters which are not "|" 它通过查找不是“|”的所有字符序列来做到这一点 or "]" OR which are a "]" followed by a character which is not a "]". 或“]” OR “是一个”]“后跟一个不是”]“的字符。
  • (\\|(([^\\]]|\\](?=[^\\]]))*))? optionally matches a "|" 可选地匹配“|” followed by the same regex as above, to get the link text part. 接着是与上面相同的正则表达式,以获取链接文本部分。 The regex is slightly-changed in that it allows "|" 正则表达式略有改变,因为它允许“|” characters. 字符。
  • Obviously the whole thing is surrounded in \\[\\[ ... \\]\\] . 显然,整个事情都被包围在\\[\\[ ... \\]\\]
  • The (?=...) notation matches a regex but doesn't consume its characters, so they can be matched subsequently. (?=...)表示法与正则表达式匹配但不消耗其字符,因此可以随后进行匹配。 I use it so as not to consume a "|" 我使用它以免消耗“|” character which may appear immediately after a "]". 可能在“]”之后立即出现的字符。

Edit : I fixed the regex to allow a "]" immediately before the "|", as in [[abcd]|efgh]] . 编辑 :我修正了正则表达式以允许紧跟在“|”之前的“]”,如[[abcd]|efgh]]

This should work: 这应该工作:

text = "The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally."
newText = re.sub(r'\[\[([^\|\]]+\|)?([^\]]+)\]\]',r'\2',text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM