简体   繁体   English

通过REGEX解析C#HTML标记

[英]C# HTML Tag parsing through REGEX

<p style="color: rgb(34, 34, 34); font-family: Arial, Verdana, sans-serif; font-size: 12px; line-height: normal;">My name is Faysal </p>

I want to parse only the String "My name is Faysal". 我只想解析字符串“我的名字是Faysal”。 I've written the following snippets,but it returns nothing. 我已经写了以下代码片段,但是什么也没返回。 What should I need to modify? 我需要修改什么?

 WebClient web = new WebClient();
        String html = web.DownloadString("http://www.dmp.gov.bd/application/index/pressdetails/press_159");


        MatchCollection m1 = Regex.Matches(html, "<p style=\"color: rgb(34, 34, 34); font-family: Arial, Verdana, sans-serif; font-size: 12px; line-height: normal;\">\\s*(.+?)\\s*</p>", RegexOptions.Singleline);


        foreach (Match m in m1) {
            String head = m.Groups[1].Value;

            Console.WriteLine(head);
        }

You can't parse [X]HTML with regex. 您无法使用正则表达式解析[X] HTML。 Because HTML can't be parsed by regex. 因为正则表达式无法解析HTML。 Regex is not a tool that can be used to correctly parse HTML. 正则表达式不是可用于正确解析HTML的工具。

retrieved from "RegEx match open tags"... “ RegEx匹配打开的标签”中检索...

I hope you will learn just like I did a long time ago. 希望您能像我很久以前一样学习。 You can NOT parse HTML using RegEx. 您不能使用RegEx解析HTML。 It is more efficient to use a parser built for HTML. 使用为HTML构建的解析器更有效。

  • If your page is in XML or XHTML, you can use the built-in parsing libraries. 如果您的页面是XML或XHTML,则可以使用内置的解析库。
    For example, System.Xml.XmlDocument . 例如, System.Xml.XmlDocument

  • If it is pure HTML, use HtmlAgilityPack , or another similar parser. 如果是纯HTML,请使用HtmlAgilityPack或其他类似的解析器。

What I would do in your case, is select the first p element, that has the style attribute set to the "whatever". 在您的情况下,我将选择第一个p元素,其样式属性设置为“ whatever”。

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

No, please don't look down here! 不,请不要在这里瞧不起!

.

.

.

.

.

.

.

Excuse me mods, if this answer is too long. 如果这个答案太长了,请问我的mod。

.

.

.

.

What you see below is UGLY, and NOT RECOMMENDED! 您在下面看到的是丑陋的,不推荐! I BEG OF YOU, DON'T LOOK! 我求求你,不要看!

.

.

.

.

.

.

.

.

"lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ " ``淋漓尽致的眼神轻柔的疼痛,律动的解析之歌将消灭人间人间的声音,在这里我可以看到你能看到它很美。他对人的谎言的最后sn灭全是我所不愿失去的,他来了,他来了,他来了,或者我的脸都渗透了我所有的脸,我的脸我的脸-天哪,不,不,不,不,不停。 ̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠͎̅a͎ŗ͈͖enot ̀̑ͧ̌rè̑ͧ̌aͨl ZÃ̘̝̙ͤ͂̾̆LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉͠P̯͍̭Ó̚N̐Y̡H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ“

.

.

.

.

.

.

.

.

.

If you absolutely have your heart set on using RegEx (kill me for saying this), then try the following expression. 如果您绝对想使用RegEx(请说出我的意思),那么请尝试以下表达式。

<p style=\"color: rgb\(34, 34, 34\); font-family: Arial, Verdana, sans-serif; font-size: 12px; line-height: normal;\">\s*(.+?)\s*</p>

It's the same, except the parentheses around "rgb" are escaped. 除“ rgb”周围的括号已转义外,其他均相同。 And I changed "\\s" to "\\s" 然后我将“ \\ s”更改为“ \\ s”

Edit 编辑

If it helps, I viewed the HTML from that website, and I could not find "My name is Faysal". 如果有帮助,我查看了该网站的HTML,但找不到“我的名字是Faysal”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM