简体   繁体   English

正则表达式匹配特定标签

[英]Regular Expression match specific tag

I write a regular expression that is我写了一个正则表达式

<w:p.*>\[.*content.*\].*</w:p>

It is working fine.它工作正常。 But sometime matches non require tag.但有时匹配非要求标签。

I've found a string from word processing like as我从文字处理中找到了一个字符串,例如

<w:p w:rsidR=‘00E52FD7’ w:rsidRDefault=‘00341592’ w:rsidP=‘000307E7’><w:pPr><w:pStyle w:val=‘Heading1’/><w:contextualSpacing w:val=‘0’/><w:jc w:val=‘center’/></w:pPr><w:r><w:rPr><w:noProof/></w:rPr><w:drawing><wp:inline distT=‘0’ distB=‘0’ distL=‘0’ distR=‘0’ wp14:anchorId=‘4F64B28D’ wp14:editId=‘6522B16C’><wp:extent cx=‘1358306’ cy=‘1343025’/><wp:effectExtent l=‘0’ t=‘0’ r=‘0’ b=‘0’/><wp:docPr id=‘2’ name=‘Picture 2’ descr=‘N:\HUMAN RESOURCES\Logos\Rancho-Logo-Type-Black.png’/><wp:cNvGraphicFramePr><a:graphicFrameLocks xmlns:a=‘http://schemas.openxmlformats.org/drawingml/2006/main’ noChangeAspect=‘1’/></wp:cNvGraphicFramePr><a:graphic xmlns:a=‘http://schemas.openxmlformats.org/drawingml/2006/main’><a:graphicData uri=‘http://schemas.openxmlformats.org/drawingml/2006/picture’><pic:pic xmlns:pic=‘http://schemas.openxmlformats.org/drawingml/2006/picture’><pic:nvPicPr><pic:cNvPr id=‘0’ name=‘Picture 1’ descr=‘N:\HUMAN RESOURCES\Logos\Rancho-Logo-Type-Black.png’/><pic:cNvPicPr><a:picLocks noChangeAspect=‘1’ noChangeArrowheads=‘1’/></pic:cNvPicPr></pic:nvPicPr><pic:blipFill><a:blip r:embed=‘rId7’ cstate=‘print’><a:extLst><a:ext uri=‘{28A0092B-C50C-407E-A947-70E740481C1C}’><a14:useLocalDpi xmlns:a14=‘http://schemas.microsoft.com/office/drawing/2010/main’ val=‘0’/></a:ext></a:extLst></a:blip><a:srcRect/><a:stretch><a:fillRect/></a:stretch></pic:blipFill><pic:spPr bwMode=‘auto’><a:xfrm><a:off x=‘0’ y=‘0’/><a:ext cx=‘1374505’ cy=‘1359042’/></a:xfrm><a:prstGeom prst=‘rect’><a:avLst/></a:prstGeom><a:noFill/><a:ln><a:noFill/></a:ln></pic:spPr></pic:pic></a:graphicData></a:graphic></wp:inline></w:drawing></w:r></w:p><w:p w:rsidR=‘00341592’ w:rsidRPr=‘00341592’ w:rsidRDefault=‘002F27D8’ w:rsidP=‘00341592’><w:pPr><w:pStyle w:val=‘Subtitle’/><w:contextualSpacing w:val=‘0’/><w:rPr><w:sz w:val=‘36’/><w:szCs w:val=‘36’/></w:rPr></w:pPr><w:r><w:t xml:space=‘preserve’>Job Description: </w:t></w:r><w:r w:rsidR=‘00360E41’><w:t>Irrigation/</w:t></w:r><w:r w:rsidR=‘004A20D0’><w:t>Maintenance Worker</w:t></w:r></w:p><w:p w:rsidR=‘000307E7’ w:rsidRDefault=‘000307E7’ w:rsidP=‘000307E7’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr><w:bookmarkStart w:id=‘0’ w:name=‘h.17ary2u5jp34’ w:colFirst=‘0’ w:colLast=‘0’/><w:bookmarkEnd w:id=‘0’/></w:p><w:p w:rsidR=‘00007B19’ w:rsidRDefault=‘00007B19’ w:rsidP=‘00341592’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr></w:p><w:p w:rsidR=‘00533338’ w:rsidRDefault=‘000307E7’ w:rsidP=‘00341592’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr><w:r><w:t xml:space=‘preserve’>Rancho has reviewed the duties described within this job description to ensure that essential functions and basic duties are included.  It is not designed to cover or contain a comprehensive listing of activities, duties or responsibilities required of an incumbent.  An incumbent may be asked to perform other duties as required or assigned by their supervisor.  </w:t></w:r></w:p><w:p w:rsidR=‘00533338’ w:rsidRDefault=‘00533338’ w:rsidP=‘00341592’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr></w:p><w:p w:rsidR=‘00710D42’ w:rsidRDefault=‘00710D42’ w:rsidP=‘00341592’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr></w:p><w:p w:rsidR=‘004618DB’ w:rsidRDefault=‘004618DB’ w:rsidP=‘004618DB’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr><w:r><w:t>[</w:t></w:r><w:proofErr w:type=‘gramStart’/><w:r><w:t>content</w:t></w:r><w:proofErr w:type=‘gramEnd’/><w:r><w:t>]</w:t></w:r></w:p>

My requirement is selecting <w:p> tag which contains我的要求是选择<w:p>标签,其中包含

[content] [内容]

But this expression matches extra <w:p> tag which not contains my require text.但是这个表达式匹配额外的<w:p>标签,它不包含我的要求文本。

Any one can help me?任何人都可以帮助我吗?

It is advisable to use an XML parser if you have an XML file to deal with.如果您要处理 XML 文件,建议使用 XML 解析器。 If you have this short fragment only, and you need it to do a one-off task, you may use either of the two regex approaches.如果您只有这个短片段,并且需要它来执行一次性任务,则可以使用两种正则表达式方法中的任何一种。

Extract all matches you want and check which one contains [content] , and only return that substring :提取您想要的所有匹配项并检查哪个包含[content] ,并仅返回该子字符串

Regex.Matches(s, @"(?s)<w:p\b[^>]*>(.*?)</w:p>")
    .Cast<Match>()
    .Where(x => x.Groups[1].Value.Contains("[content]"))
    .Select(z => z.Value);

Note that here, (?s)<w:p\\b[^>]*>(.*?)</w:p> matches <w:p , then asserts there is no word char immediately to the right with a \\b word boundary, then matches the rest of the element by consuming 0+ chars other than > and then > , then it captures any 0+ chars, as few as possible, into Group 1 ( x.Groups[1].Value ) and finally matches </w:p> .请注意,这里(?s)<w:p\\b[^>]*>(.*?)</w:p>匹配<w:p ,然后断言右边没有字 char \\b字边界,然后通过消耗除>>之外的 0+ 个字符来匹配元素的其余部分,然后将任何 0+ 个字符捕获到组 1 中( x.Groups[1].Value )最后匹配</w:p> The .Where(x => x.Groups[1].Value.Contains("[content]")) condition only keeps those that contain [content] in the inner XML part of the w:p element. .Where(x => x.Groups[1].Value.Contains("[content]"))条件只保留w:p元素的内部 XML 部分中包含[content]那些。

Use a more sophisticated regex with a tempered greedy token:使用更复杂的正则表达式和温和的贪婪令牌:

(?s)<w:p\b[^>]*>(?:(?!<w:p\b).)*?\[content].*?</w:p>

Details细节

  • (?s) - a RegexOptions.Singleline inline option (?s) - 一个RegexOptions.Singleline内联选项
  • <w:p - a <w:p substring <w:p - <w:p子字符串
  • \\b - word boundary \\b - 词边界
  • [^>]* - 0+ chars other than > [^>]* - 除> 0+ 个字符
  • > - a > > - 一个>
  • (?:(?!<w:p\\b).)*? - any char, 0+ times but as few as possible, that is not a starting point for <w:p followed with a word boundary sequence - 任何字符,0+ 次但尽可能少,这不是<w:p的起点,后跟单词边界序列
  • \\[content] - a [content] substring \\[content] - 一个[content]子字符串
  • .*? - any 0+ chars, as few as possible - 任何 0+ 个字符,尽可能少
  • </w:p> - a literal </w:p> substring </w:p> - 一个文字</w:p>子串

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM