简体   繁体   English

C#中的HTML解析-分类

[英]HTML parsing in C# - classification

I am working on a sentiment classification and I was parsing data from local movie database. 我正在进行情感分类,并且正在解析本地电影数据库中的数据。 Problem is that they have three forms of classification. 问题是它们具有三种分类形式。 One with stars (realized in ...) one "rubbish" and without giving stars or calling it rubbish here is the main link to it : http://www.csfd.cz/film/7049-playgirls/?all=1 you need to check source code - here is an example you can see all three kinds of user evaluation of movies. 一个带有星星(在...中实现)的“垃圾”,而没有给星星或在这里不称之为垃圾是它的主要链接: http : //www.csfd.cz/film/7049-playgirls/?all=1您需要检查源代码-这是一个示例,您可以查看所有三种用户对电影的评价。

</li>
<li id="comment-8356897">
    <h5 class="author"><a href="/uzivatel/138463-campbell/">Campbell</a></h5>
    <img src="http://img.csfd.cz/assets/images/rating/stars/2.gif" class="rating" width="16" alt="**" />
    <div class="info">
        <a href="/uzivatel/138463-campbell/komentare/">všechny komentáře uživatele</a></div>
    <p class="post">Ale jo:-D Když jsem viděl že tenhle film je na prvním místě mezi největšíma sračkama na CSFD, a tak jsem se zhrozil a abych si utrpení ještě vylepšil, tak jsem si pustil oba dva díly naráz. No hell to celkem bylo ale ne nic extrémní. Viděl jsem větší shity. V tomhle filmu jsem měl děsnej problém fandit někomu fandit protože to moc nejde. Šílenost, Ale ne nejhorší.<span class="date desc">(11.3.2011)</span></p>
</li>
<li id="comment-872277">
    <h5 class="author"><a href="/uzivatel/48974-fleker/">fleker</a></h5>

    <div class="info">
        <a href="/uzivatel/48974-fleker/komentare/">všechny komentáře uživatele</a></div>
    <p class="post">tak na todle rači ani koukat nebudu; hodnocení to má slušný ale nechci riskovat aby mi vyschla mícha<span class="date desc">(29.7.2009)</span></p>
</li>
<li id="comment-327360">
    <h5 class="author"><a href="/uzivatel/41698-ozo/">Ozo</a></h5>
    <strong class="rating">odpad!</strong>
    <div class="info">
        <a href="/uzivatel/41698-ozo/komentare/">všechny komentáře uživatele</a></div>
    <p class="post">Změna názoru - tohle si jednu hvězdičku nezaslouží =(<span class="date desc">(29.7.2007)</span></p>
</li>

Thanks a lot my plan was to do it like this : 非常感谢我的计划是这样做的:

string srxPathOfCategory = "//ul[@class='ui-posts-list']//li//img[@class='rating'] | //ul[@class='ui-posts-list']//li//strong[@class='rating']";
        foreach (var att in doc.DocumentNode.SelectNodes(srxPathOfCategory)) // | .//strong[@class='rating']")){
        {

            if (att.InnerText == "odpad!")  //odpad means rubbish
            {
                b[j] = att.InnerText; //saving "odpad!" for later use

            }
            if (att.Attributes["alt"] != null)

            {
                b[j] = att.Attributes["alt"].Value; //these values are from 1* to 5*****

            }
          if (att.InnerText != "odpad!" && att.Attributes["alt"] == null)//this is where the problems starts
            {
                   b[j] = "without user evaluation";

            }

            j++;
        }

Problem with this code is that if it fails to find att.InnerText == "odpad!" 此代码的问题是,如果找不到att.InnerText ==“ odpad!” or att.Attributes["alt"] != null it continues to the next post and take user evaluation from there. 或att.Attributes [“ alt”]!= null,它将继续到下一篇文章,并从那里进行用户评估。 But I would like to match at least something to the post where the evaluation was ommited. 但我想至少对省略评估的职位进行匹配。

thanks for all help but the problem was in tha xpath for html. 感谢所有帮助,但问题出在tha xpath for html中。

I solved it like this 我这样解决了

string srxPathOfCategory = "//ul[@class='ui-posts-list']//li";

        foreach (var att in doc.DocumentNode.SelectNodes(srxPathOfCategory))
        {

            foreach (var child in att.ChildNodes.Skip(3)) // skipping first three nodes //- first one is whitespace - marked as #text child node, then there is h5 and third is //another whitespace marked as #text child node 
            {

                if (child.InnerText == "odpad!")
                {
                    b[j] = child.InnerText;
                    Console.WriteLine(b[j]);
                    Console.ReadKey();
                    break;

                }
                else if (child.Attributes["alt"] != null)
                {
                    b[j] = child.Attributes["alt"].Value;
                    Console.WriteLine(b[j]);
                    Console.ReadKey();
                    break;
                }
                else
                {
                    b[j] = "without user evaluation";
                    Console.WriteLine("hlupost");
                    Console.ReadKey();
                    break;
                }

            }
            j++;
        }

"odpad!" “ odpad!” is not in an Attribute, it's in an Element. 不是在属性中,而是在元素中。

What if you change your if statements. 如果更改if语句怎么办。 Why do you even have there 3 if statements if only one can be true? 如果只有一个可以为真,为什么还要有3个if语句呢?

// Is it "odpad" ?
if (att.InnerText == "odpad!")
{
    b[j] = att.InnerText;

}
// .. If not, is it starred?
else if (att.Attributes["alt"] != null)
{
    b[j] = att.Attributes["alt"].Value;

}
// If none of above, it must be this (default)
else
{
       b[j] = "without user evaluation";

}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM