从网站上阅读特定文本

Question

I am trying to make a database, but i need to get info from a website. 我正在尝试建立数据库，但是我需要从网站获取信息。 Mainly the Title, Date, Length and Genre from the IMDB website. 主要是IMDB网站上的标题，日期，长度和类型。 I have tried like 50 different things and it is just not working. 我已经尝试了50种不同的方法，但是它没有用。 Here is my code. 这是我的代码。

    public string GetName(string URL)
{       
    HtmlWeb web = new HtmlWeb();
    HtmlDocument doc = web.Load(URL);

    var Attr = doc.DocumentNode.SelectNodes("//*[@id=\"overview - top\"]/h1/span[1]@itemprop")[0];

    return Name;
}

When I run this it just gives me a XPathException. 当我运行它时，它只是给我一个XPathException。 I just want it to return the Title of a movie. 我只希望它返回电影的标题。 I am now just using this movie for a example and testing but, I want it to work with all movies http://www.imdb.com/title/tt0405422 I am using the HtmlAgilityPack. 我现在仅以这部电影为例进行测试，但是，我希望它可与所有电影一起使用http://www.imdb.com/title/tt0405422我正在使用HtmlAgilityPack。

Answer 1

I making something familiar and this is my code which gets info from imdb.com website.: 我做了一些熟悉的事情，这是我的代码，可从imdb.com网站获取信息。

string html = getUrlData(imdbUrl + "combined");
            Id = match(@"<link rel=""canonical"" href=""http://www.imdb.com/title/(tt\d{7})/combined"" />", html);
            if (!string.IsNullOrEmpty(Id))
            {
                status = true;
                Title = match(@"<title>(IMDb \- )*(.*?) \(.*?</title>", html, 2);
                OriginalTitle = match(@"title-extra"">(.*?)<", html);
                Year = match(@"<title>.*?\(.*?(\d{4}).*?\).*?</title>", html);
                Rating = match(@"<b>(\d.\d)/10</b>", html);
                Genres = matchAll(@"<a.*?>(.*?)</a>", match(@"Genre.?:(.*?)(</div>|See more)", html));
                Directors = matchAll(@"<td valign=""top""><a.*?href=""/name/.*?/"">(.*?)</a>", match(@"Directed by</a></h5>(.*?)</table>", html));
                Cast = matchAll(@"<td class=""nm""><a.*?href=""/name/.*?/"".*?>(.*?)</a>", match(@"<h3>Cast</h3>(.*?)</table>", html));
                Plot = match(@"Plot:</h5>.*?<div class=""info-content"">(.*?)(<a|</div)", html);
                Runtime = match(@"Runtime:</h5><div class=""info-content"">(\d{1,4}) min[\s]*.*?</div>", html);
                Languages = matchAll(@"<a.*?>(.*?)</a>", match(@"Language.?:(.*?)(</div>|>.?and )", html));
                Countries = matchAll(@"<a.*?>(.*?)</a>", match(@"Country:(.*?)(</div>|>.?and )", html));
                Poster = match(@"<div class=""photo"">.*?<a name=""poster"".*?><img.*?src=""(.*?)"".*?</div>", html);
                if (!string.IsNullOrEmpty(Poster) && Poster.IndexOf("media-imdb.com") > 0)
                {
                    Poster = Regex.Replace(Poster, @"_V1.*?.jpg", "_V1._SY200.jpg");
                    PosterLarge = Regex.Replace(Poster, @"_V1.*?.jpg", "_V1._SY500.jpg");
                    PosterFull = Regex.Replace(Poster, @"_V1.*?.jpg", "_V1._SY0.jpg");
                }
                else
                {
                    Poster = string.Empty;
                    PosterLarge = string.Empty;
                    PosterFull = string.Empty;
                }
                ImdbURL = "http://www.imdb.com/title/" + Id + "/";
                if (GetExtraInfo)
                {
                    string plotHtml = getUrlData(imdbUrl + "plotsummary");
                }

//Match single instance
    private string match(string regex, string html, int i = 1)
    {
        return new Regex(regex, RegexOptions.Multiline).Match(html).Groups[i].Value.Trim();
    }

    //Match all instances and return as ArrayList
    private ArrayList matchAll(string regex, string html, int i = 1)
    {
        ArrayList list = new ArrayList();
        foreach (Match m in new Regex(regex, RegexOptions.Multiline).Matches(html))
            list.Add(m.Groups[i].Value.Trim());
        return list;
    }

Maybe you will find something useful 也许您会发现有用的东西

Answer 2

The last bit of your XPath is not valid. XPath的最后一位无效。 Also to get only single element from HtmlDocument() you can use SelectSingleNode() instead of SelectNodes() : 同样，要从HtmlDocument()仅获取单个元素，可以使用SelectSingleNode()而不是SelectNodes() ：

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.imdb.com/title/tt0405422/");

var xpath = "//*[@id='overview-top']/h1/span[@class='itemprop']";
var span = doc.DocumentNode.SelectSingleNode(xpath);
var title = span.InnerText;

Console.WriteLine(title);

output : 输出：

The 40-Year-Old Virgin

demo link : * 演示链接： *

https://dotnetfiddle.net/P7U5A7 https://dotnetfiddle.net/P7U5A7

*) the demo shows that the correct title is printed, along with an error specific to .NET Fiddle (you can safely ignore the error). *）该演示显示了正确的标题以及特定于.NET Fiddle的错误（您可以放心地忽略该错误）。

从网站上阅读特定文本

问题描述

2 个解决方案

解决方案1
0 2015-12-06 11:50:44

解决方案2
0 已采纳 2015-12-06 11:55:19

从网站上阅读特定文本

问题描述

2 个解决方案

解决方案1 0 2015-12-06 11:50:44

解决方案2 0 已采纳 2015-12-06 11:55:19

解决方案1
0 2015-12-06 11:50:44

解决方案2
0 已采纳 2015-12-06 11:55:19