简体   繁体   English

正则表达式查找缺少某些部分的字符串的子字符串

[英]Regex to find substrings of a string missing some parts

I have a long string (html of website) and I want to scrape the substrings. 我有一个长字符串(网站的html),我想抓取这些子字符串。

For example, some of the output contains something like this: 例如,某些输出包含以下内容:

<span title="Minecraft: Pocket Edition" class="oneline-info title-info">
  <a href="/apps/ios/app/minecraft-pocket-edition/">Minecraft: Pocket Edition</a>
</span>


    <span title="Mojang" class="oneline-info add-info" data-items="1">
        <a href="/apps/ios/publisher/mojang/">Mojang</a>
    </span>

I want to scrape everything from <span title= to </span> (In the above example, that means it will be 2 different matches) 我想抓取从<span title=</span> (在上面的示例中,这意味着将有2个不同的匹配项)

So, I have this code: 所以,我有这段代码:

        var matches = Regex.Matches(s, @"<span title=(?<content>(?:(?!""</span>).)+)");
        scrapeTitles.AddRange(matches.Cast<Match>().Select(x => x.Groups["content"].Value).ToList());

But for some reason, it's not scraping all the data between those 2 words. 但是由于某种原因,它并没有抓取这两个词之间的所有数据。 It only gives me outputs like this: 它只给我这样的输出:

"Minecraft: Pocket Edition" class="oneline-info title-info">
"Mojang" class="oneline-info add-info" data-items="1">
"Clash of Clans" class="oneline-info title-info">
"Supercell" class="oneline-info add-info" data-items="1">

I need to scrape all the data, including the <a> line as well. 我需要抓取所有数据,包括<a>行。

"Mojang" class="oneline-info add-info" data-items="1">
            <a href="/apps/ios/publisher/mojang/">Mojang</a>

The problem is your match doesn't take proper care of the new line character. 问题是您的比赛没有正确照顾换行符。

Try this one: 试试这个:

<span title=(?<content>(?:(.|\n)(?!</span>))+)

See live version . 请参阅实时版本

Disclaimer: I stronly recommend NOT to do HTML (SGML actually) parsing using regular expressions. 免责声明:我强烈建议不要使用正则表达式进行HTML(实际上是SGML)解析。 It leads to broken behavior in the long run. 从长远来看,它会导致行为中断。

您没有捕获换行符,因此可以更新正则表达式以解析它们,或执行以下操作:

var matches = Regex.Matches(s.Replace(Environment.NewLine, string.Empty), @"<span title=(?<content>(?:(?!""</span>).)+)");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM