正则表达式从C＃中的字符串中提取所需的数据

Question

I have a webpage. 我有一个网页。 If I look at the "view-source" of the page, I find multiple instance of following statement: 如果查看页面的“视图源”，则会发现以下语句的多个实例：

<td class="my_class" itemprop="main_item">statement 1</td>
<td class="my_class" itemprop="main_item">statement 2</td>
<td class="my_class" itemprop="main_item">statement 3</td>

I want to extract data like this: 我想这样提取数据：

statement 1
statement 2
statement 3

To accomplish this, I have made a method " GetContent " which takes "URL" as parameter and copy all the content of the webpage source in a C# string. 为此，我制作了一种方法“ GetContent ”，该方法以“ URL”作为参数，并将网页源的所有内容复制到C＃字符串中。

private string GetContent(string url)
{
    HttpWebResponse response = null;
    StreamReader respStream = null;

    HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
    request.Timeout = 100000;
    response = (HttpWebResponse)request.GetResponse();
    respStream = new StreamReader(response.GetResponseStream());
    return respStream.ReadToEnd();
}

Now I want to create a method " GetMyList " which will extract the list I want. 现在，我想创建一个方法“ GetMyList ”，它将提取我想要的列表。 I am searching for the possible regex which can serve my purpose. 我正在寻找可能适合我的目的的正则表达式。 Any help is highly appreciated. 非常感谢您的帮助。

Answer 1

using the HTML AgilityPack , this would be really easy... 使用HTML AgilityPack ，这真的很容易...

  HtmlDocument doc= new HtmlDocument ();
  doc.LoadHtml(html);
  //var nodes = doc.DocumentNode.SelectNodes("//td//text()");
  var nodes = doc.DocumentNode.SelectNodes("//td[@itemprop=\"main_item\"]//text()");
  var list = new List<string>();
            foreach (var m in nodes)
            {
                list.Add(m.InnerText);
            }

But if you want Regex , Try this : 但是，如果您要使用正则Regex ，请尝试以下操作：

            string regularExpressionPattern1 = @"<td.*?>(.*?)<\/td>";
            Regex regex = new Regex(regularExpressionPattern1, RegexOptions.Singleline);
            MatchCollection collection = regex.Matches(html.ToString());
            var list = new List<string>();
            foreach (Match m in collection)
            {
                list.Add( m.Groups[1].Value);
            }

Answer 2

Hosseins answer is pretty much the solution (and I would recommend you to use a parser if you have the option) but a regular expression with non-capturing paraentheses ?: would bring you the extracted data statement 1 or statement 2 as you need it: Hosseins的答案几乎是解决方案（如果您有此选择，我建议您使用解析器），但是带有不捕获括号的正则表达式?:将根据需要将提取的数据statement 1或statement 2带给您：

IEnumerable<string> GetMyList(string str)
{
    foreach(Match m in Regex.Matches(str, @"(?:<td.*?>)(.*?)(?:<\/td>)"))
        yield return m.Groups[1].Value;
}

See Explanation at regex101 for a more detailed description. 有关更多详细说明，请参见regex101中的解释。

正则表达式从C＃中的字符串中提取所需的数据

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-08-29 04:03:50

解决方案2
1 2018-08-29 04:38:04

正则表达式从C＃中的字符串中提取所需的数据

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-08-29 04:03:50

解决方案2 1 2018-08-29 04:38:04

解决方案1
3 已采纳 2018-08-29 04:03:50

解决方案2
1 2018-08-29 04:38:04