繁体   English   中英

AngleSharp-字典的HTML属性

[英]AngleSharp - HTML attributes to Dictionary

我想解析网页中的架构HTML,以在内部使用JSON。 因此,我正在为此测试AngleSharp,并且几乎成功获得了我所需要的东西,但是输出格式存在一些问题。 查看我的输出与预期

string html = @"<div itemscope itemtype='http://schema.org/Organization'>
  <span itemprop='name'>Google.org (GOOG)</span>
  <div itemprop='address' itemscope itemtype='http://schema.org/PostalAddress'>
    Main address:
      <span itemprop='streetAddress'>38 avenue de l'Opera</span>
      <span itemprop='postalCode'>F-75002</span>
      <span itemprop='addressLocality'>Paris, France</span>
  </div>
    Tel:<span itemprop='telephone'>( 33 1) 42 68 53 00 </span>,
    Fax:<span itemprop='faxNumber'>( 33 1) 42 68 53 01 </span>,
    E-mail: <span itemprop='email'>secretariat(at)google.org</span>
 <span itemprop='alumni' itemscope itemtype='http://schema.org/Person'>
   <span itemprop='name'>Jack Dan</span>
 </span>
 <span itemprop='alumni' itemscope itemtype='http://schema.org/Person'>
   <span itemprop='name'>John Smith</span>
 </span>
</div>";

HtmlParser parser = new HtmlParser();
var htmlDocument = parser.ParseDocument(html);
var scopes = htmlDocument.All.Where(x => x.Attributes.Any(a => a.Name == "itemtype")).ToList();
var dictionary = new Dictionary<string, object>();

foreach(var scope in scopes)
{
    var childrens = scope.Children.Where(x => x.Attributes.Any(a => a.Name == "itemprop"));
    foreach (var child in childrens)
    {
        string prop = child.GetAttribute("itemprop");
        if (!dictionary.ContainsKey(prop))
        {
            dictionary.Add(prop, child.TextContent);
        }
    }
}
//dictionary.Dump();
string json = JsonConvert.SerializeObject(dictionary, Newtonsoft.Json.Formatting.Indented);
json.Dump();

我的输出

{
  "name": "Google.org (GOOG)",
  "address": "\n    Main address:\n      38 avenue de l'Opera\n      F-75002\n      Paris, France\n  ",
  "telephone": "( 33 1) 42 68 53 00 ",
  "faxNumber": "( 33 1) 42 68 53 01 ",
  "email": "secretariat(at)google.org",
  "alumni": "\n   Jack Dan\n ",
  "streetAddress": "38 avenue de l'Opera",
  "postalCode": "F-75002",
  "addressLocality": "Paris, France"
}

预期

{
  "itemtype": "http://schema.org/Organization",
  "name": "Google.org (GOOG)",  
  "address": {
    "itemtype": "PostalAddress",
    "addressLocality": "Paris, France",
    "postalCode": "F-75002",
    "streetAddress": "38 avenue de l'Opera"
  },
  "email": "secretariat(at)google.org",
  "faxNumber": "( 33 1) 42 68 53 01",
  "telephone": "( 33 1) 42 68 53 00",
  "alumni": [
    {
      "itemtype": "http://schema.org/Person",
      "name": "Jack Dan"
    },
    {
      "itemtype": "http://schema.org/Person",
      "name": "John Smith"
    }
  ]  
}

.Children不会遍历所有的孩子/子孩子以获取所有属性,然后附加到它所属的父itemscope。 任何建议我应该改变什么/在哪里获得期望格式的输出。

拉斯代码输出:

{
  "name": "Google.org (GOOG)",
  "address": "\n\t    Main address:\n\t      38 avenue de l'Opera\n\t      F-75002\n\t      Paris, France\n\t  ",
  "telephone": "( 33 1) 42 68 53 00 ",
  "faxNumber": "( 33 1) 42 68 53 01 ",
  "email": "secretariat(at)google.org",
  "alumni": "\n\t   Jack Dan\n\t "
}
{
  "streetAddress": "38 avenue de l'Opera",
  "postalCode": "F-75002",
  "addressLocality": "Paris, France"
}
{
  "name": "Jack Dan"
}
{
  "name": "John Smith"
}

我不会在这里进行所有编码,但是这个想法可能是递归的方式:

string GetJsonStringFromItem(ScopeType scope) {
  var localDict = new Dictionary<string, object>();
  var childrens = scope.Children.Where(x => x.Attributes.Any(a => a.Name == "itemprop"));
  foreach (var child in childrens)
  {
      var propValue;
      if (child.HasAttribute("itemscope"))
      {
          // this is the recursion: do the same with the nested scope
          propValue = GetJsonStringFromItem(child);
      } else {
          propValue = child.TextContent;
      }

      string prop = child.GetAttribute("itemprop");
      if (!localDict.ContainsKey(prop))
      {
          localDict.Add(prop, propValue);
      }
  }
  return JsonConvert.SerializeObject(localDict, Newtonsoft.Json.Formatting.Indented);
}

然后,在最外层的循环中调用此代码。

以下是我的解决方案。

async Task Main()
{
    var html = @"<div itemscope itemtype='http://schema.org/Organization'>
  <span itemprop='name'>Google.org (GOOG)</span>
  <div itemprop='address' itemscope itemtype='http://schema.org/PostalAddress'>
    Main address:
      <span itemprop='streetAddress'>38 avenue de l'Opera</span>
      <span itemprop='postalCode'>F-75002</span>
      <span itemprop='addressLocality'>Paris, France</span>
  </div>
    Tel:<span itemprop='telephone'>( 33 1) 42 68 53 00 </span>,
    Fax:<span itemprop='faxNumber'>( 33 1) 42 68 53 01 </span>,
    E-mail: <span itemprop='email'>secretariat(at)google.org</span>
 <span itemprop='alumni' itemscope itemtype='http://schema.org/Person'>
   <span itemprop='name'>Jack Dan</span>
 </span>
 <span itemprop='alumni' itemscope itemtype='http://schema.org/Person'>
   <span itemprop='name'>John Smith</span>
 </span>
</div>";
    var context = BrowsingContext.New();
    var document = await context.OpenAsync(res => res.Content(html));
    var result = Parse(document.QuerySelector("[itemscope]"));
    var json = JsonConvert.SerializeObject(result, Newtonsoft.Json.Formatting.Indented);
    json.Dump();
}

void Populate(IElement element, Dictionary<string, object> result)
{
    foreach (var child in element.Children)
    {
        var prop = child.GetAttribute("itemprop");

        if (prop != null)
        {
            var scope = child.GetAttribute("itemscope");
            var value = default(Object);

            if (scope != null)
            {
                value = Parse(child);
            }
            else
            {
                value = child.TextContent;
            }

            if (result.TryGetValue(prop, out var item))
            {
                if (item is List<Object> list)
                {
                    list.Add(value);
                }
                else
                {
                    result[prop] = new List<Object>
                    {
                        item,
                        value
                    };
                }
            }
            else
            {
                result[prop] = value;
            }
        }
        else
        {
            Populate(child, result);
        }
    }
}

Object Parse(IElement element)
{
    var result = new Dictionary<string, object>();
    result["itemtype"] = element.GetAttribute("itemtype");
    Populate(element, result);
    return result;
}

不知道我是否一切正确,但是我的输出如下所示:

{
  "itemtype": "http://schema.org/Organization",
  "name": "Google.org (GOOG)",
  "address": {
        "itemtype": "http://schema.org/PostalAddress",
    "streetAddress": "38 avenue de l'Opera",
    "postalCode": "F-75002",
    "addressLocality": "Paris, France"
  },
  "telephone": "( 33 1) 42 68 53 00 ",
  "faxNumber": "( 33 1) 42 68 53 01 ",
  "email": "secretariat(at)google.org",
  "alumni": [
    {
      "itemtype": "http://schema.org/Person",
      "name": "Jack Dan"
    },
    {
    "itemtype": "http://schema.org/Person",
      "name": "John Smith"

    }
  ]
}

符合预期的输出。 Parse的代码可能不太优雅,但至少可以单次获得结果(不需要其他QuerySelector )。

我不确定预期的输入是什么,但是很可能您将需要添加一些保护措施以防止怪异的HTML /输入(例如,在设置itemscope时检查是否确实有可用的itemtype )。

希望有帮助!

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM