简体   繁体   中英

Html page parsing using Html Agility Pack

I'm, trying to parse the IMDb page with Regex (I know HAP is better), but my RegEx is wrong, so may be you can advice me how to use HAP correctly.

This is the part of page I'm trying to parse. I need to take 2 numbers from here:

5 out of 5 people (so these two five's i need, two numbers)

<small>5 out of 5 people found the following review useful:</small>
<br>
<a href="/user/ur1174211/">
<h2>Interesting, Particularly in Comparison With "La Sortie des usines Lumière"</h2>
<b>Author:</b>
<a href="/user/ur1174211/">Snow Leopard</a>
<small>from Ohio</small>
<br>
<small>10 March 2005</small>


and this is my code on c#

Regex reg1 = new Regex("([0-9]+(out of)+[0-9])");
for (int i = 0; i < number; i++)
        {
            Console.WriteLine("the heading of the movie is {0}", header[i].InnerHtml);
            Match m = reg1.Match(header[i].InnerHtml);

            if (!m.Success)
            {
                return;
            }
            else
            {
                string str1 = m.Value.Split(' ')[0];
                string str2 = m.Value.Split(' ')[3];

                if (!Int32.TryParse(str1, out index1))
                {
                    return;
                }
                if (!Int32.TryParse(str2, out index2))
                {
                    return;
                }
                Console.WriteLine("index1 = {0}", index1);
                Console.WriteLine("index2 = {0}", index2);
            }
        }

Big thanks to everybody who read this.

Try this. This way you will take numbers not only digits.

    Regex reg1 = new Regex(@"(\d* (out of) \d*)");
    for (int i = 0; i < number; i++)
    {
      Console.WriteLine("the heading of the movie is {0}", header[i].InnerHtml);
      Match m = reg1.Match(header[i].InnerHtml);

      if (!m.Success)
      {
          return;
      }
      else
      {
          Regex reg2 = new Regex(@"\d+");
          m = reg2.Match(m.Value);
          string str1 = m.Value;
          string str2 = m.NextMatch().Value;

          if (!Int32.TryParse(str1, out index1))
          {
              return;
          }
          if (!Int32.TryParse(str2, out index2))
          {
              return;
          }
          Console.WriteLine("index1 = {0}", index1);
          Console.WriteLine("index2 = {0}", index2);
      }
    }

if you have the InnerHtml of the small tag then this can also be done to get numbers

var title = "5 out of 5 people found the following review useful:";
var titleNumbers = title.ToCharArray().Where(x => Char.IsNumber(x));

EDIT

as @PulseLab suggests, i have an alternate method

var sd = s.Split(' ').Where((data) =>
        {
            var datum = 0;
            int.TryParse(data, out datum);
            return datum > 0;
        }).ToArray();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM