简体   繁体   English

从网站上阅读特定文本

[英]Reading Specific text from a website

I am trying to make a database, but i need to get info from a website. 我正在尝试建立数据库,但是我需要从网站获取信息。 Mainly the Title, Date, Length and Genre from the IMDB website. 主要是IMDB网站上的标题,日期,长度和类型。 I have tried like 50 different things and it is just not working. 我已经尝试了50种不同的方法,但是它没有用。 Here is my code. 这是我的代码。

    public string GetName(string URL)
{       
    HtmlWeb web = new HtmlWeb();
    HtmlDocument doc = web.Load(URL);

    var Attr = doc.DocumentNode.SelectNodes("//*[@id=\"overview - top\"]/h1/span[1]@itemprop")[0];

    return Name;
}

When I run this it just gives me a XPathException. 当我运行它时,它只是给我一个XPathException。 I just want it to return the Title of a movie. 我只希望它返回电影的标题。 I am now just using this movie for a example and testing but, I want it to work with all movies http://www.imdb.com/title/tt0405422 I am using the HtmlAgilityPack. 我现在仅以这部电影为例进行测试,但是,我希望它可与所有电影一起使用http://www.imdb.com/title/tt0405422我正在使用HtmlAgilityPack。

I making something familiar and this is my code which gets info from imdb.com website.: 我做了一些熟悉的事情,这是我的代码,可从imdb.com网站获取信息。

string html = getUrlData(imdbUrl + "combined");
            Id = match(@"<link rel=""canonical"" href=""http://www.imdb.com/title/(tt\d{7})/combined"" />", html);
            if (!string.IsNullOrEmpty(Id))
            {
                status = true;
                Title = match(@"<title>(IMDb \- )*(.*?) \(.*?</title>", html, 2);
                OriginalTitle = match(@"title-extra"">(.*?)<", html);
                Year = match(@"<title>.*?\(.*?(\d{4}).*?\).*?</title>", html);
                Rating = match(@"<b>(\d.\d)/10</b>", html);
                Genres = matchAll(@"<a.*?>(.*?)</a>", match(@"Genre.?:(.*?)(</div>|See more)", html));
                Directors = matchAll(@"<td valign=""top""><a.*?href=""/name/.*?/"">(.*?)</a>", match(@"Directed by</a></h5>(.*?)</table>", html));
                Cast = matchAll(@"<td class=""nm""><a.*?href=""/name/.*?/"".*?>(.*?)</a>", match(@"<h3>Cast</h3>(.*?)</table>", html));
                Plot = match(@"Plot:</h5>.*?<div class=""info-content"">(.*?)(<a|</div)", html);
                Runtime = match(@"Runtime:</h5><div class=""info-content"">(\d{1,4}) min[\s]*.*?</div>", html);
                Languages = matchAll(@"<a.*?>(.*?)</a>", match(@"Language.?:(.*?)(</div>|>.?and )", html));
                Countries = matchAll(@"<a.*?>(.*?)</a>", match(@"Country:(.*?)(</div>|>.?and )", html));
                Poster = match(@"<div class=""photo"">.*?<a name=""poster"".*?><img.*?src=""(.*?)"".*?</div>", html);
                if (!string.IsNullOrEmpty(Poster) && Poster.IndexOf("media-imdb.com") > 0)
                {
                    Poster = Regex.Replace(Poster, @"_V1.*?.jpg", "_V1._SY200.jpg");
                    PosterLarge = Regex.Replace(Poster, @"_V1.*?.jpg", "_V1._SY500.jpg");
                    PosterFull = Regex.Replace(Poster, @"_V1.*?.jpg", "_V1._SY0.jpg");
                }
                else
                {
                    Poster = string.Empty;
                    PosterLarge = string.Empty;
                    PosterFull = string.Empty;
                }
                ImdbURL = "http://www.imdb.com/title/" + Id + "/";
                if (GetExtraInfo)
                {
                    string plotHtml = getUrlData(imdbUrl + "plotsummary");
                }

//Match single instance
    private string match(string regex, string html, int i = 1)
    {
        return new Regex(regex, RegexOptions.Multiline).Match(html).Groups[i].Value.Trim();
    }

    //Match all instances and return as ArrayList
    private ArrayList matchAll(string regex, string html, int i = 1)
    {
        ArrayList list = new ArrayList();
        foreach (Match m in new Regex(regex, RegexOptions.Multiline).Matches(html))
            list.Add(m.Groups[i].Value.Trim());
        return list;
    }

Maybe you will find something useful 也许您会发现有用的东西

The last bit of your XPath is not valid. XPath的最后一位无效。 Also to get only single element from HtmlDocument() you can use SelectSingleNode() instead of SelectNodes() : 同样,要从HtmlDocument()仅获取单个元素,可以使用SelectSingleNode()而不是SelectNodes()

HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://www.imdb.com/title/tt0405422/");

var xpath = "//*[@id='overview-top']/h1/span[@class='itemprop']";
var span = doc.DocumentNode.SelectSingleNode(xpath);
var title = span.InnerText;

Console.WriteLine(title);

output : 输出:

The 40-Year-Old Virgin

demo link : * 演示链接: *

https://dotnetfiddle.net/P7U5A7 https://dotnetfiddle.net/P7U5A7

*) the demo shows that the correct title is printed, along with an error specific to .NET Fiddle (you can safely ignore the error). *)该演示显示了正确的标题以及特定于.NET Fiddle的错误(您可以放心地忽略该错误)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM