简体   繁体   English

如何使用Microdata提取/解析HTML

[英]How to extract/parse HTML using Microdata

I am pretty new to Microdata. 我对Microdata很新。

I have a HTML string with Microdata. 我有一个带Microdata的HTML字符串。 I am trying to figure out if it's possible to extract the required information dynamically using Microdata with JS or jQuery. 我试图弄清楚是否可以使用带有JS或jQuery的Microdata动态提取所需的信息。 Has anyone done this before? 有没有人这样做过?

Example HTML string: I am trying to get the 'content' corresponding to itemprop 'ratingValue' for item prop-name 'Blendmagic' 示例HTML字符串:我正在尝试获取与项目道具名称'Blendmagic'的itemprop'legitValue'对应的'内容'

<html>
    <div itemscope itemtype="http://schema.org/Offer">
        <span itemprop="name">Blendmagic</span>
        <span itemprop="price">$19.95</span>
        <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
            <img src="four-stars.jpg" />
            <meta itemprop="ratingValue" content="4" />
            <meta itemprop="bestRating" content="5" />
            Based on <span itemprop="ratingCount">25</span> user ratings
        </div>
    </div>
    <div itemscope itemtype="http://schema.org/Offer">
        <span itemprop="name">testMagic</span>
        <span itemprop="price">$10.95</span>
        <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
            <img src="four-stars.jpg" />
            <meta itemprop="ratingValue" content="4" />
            <meta itemprop="bestRating" content="5" />
            Based on <span itemprop="ratingCount">25</span> user ratings
        </div>
    </div>
</html>

Try beginning at the root itemscope node , filter descendant elements having itemprop attributes; 尝试从根itemscope节点开始,过滤具有itemprop属性的后代元素; return object result containing array items holding Microdata item s. 返回对象result包含包含Microdata item的数组items

This solution is based on the algorithm found at Microdata 该解决方案基于Microdata上的算法

7 Converting HTML to other formats 7将HTML转换为其他格式

7.1 JSON 7.1 JSON

Given a list of nodes nodes in a Document, a user agent must run the following algorithm to extract the microdata from those nodes into a JSON form: 给定Document中的节点节点列表,用户代理必须运行以下算法以将这些节点中的微数据提取为JSON格式:

Let result be an empty object. 让结果成为一个空对象。

Let items be an empty array. 设项为空数组。

For each node in nodes, check if the element is a top-level microdata item, and if it is then get the object for that element and add it to items. 对于节点中的每个节点,检查元素是否为顶级微数据项,如果是,则获取该元素的对象并将其添加到项目中。

Add an entry to result called "items" whose value is the array items. 在结果中添加一个名为“items”的条目,其值为数组项。

Return the result of serializing result to JSON in the shortest possible way (meaning no whitespace between tokens, no unnecessary zero digits in numbers, and only using Unicode escapes in strings for characters that do not have a dedicated escape sequence), and with a lowercase "e" used, when appropriate, in the representation of any numbers. 以尽可能短的方式将结果序列化结果返回给JSON(意味着令牌之间没有空格,数字中没有不必要的零位数,只有字符串中的Unicode转义符用于没有专用转义序列的字符),并且小写在适当的情况下,“e”用于表示任何数字。 [JSON] [JSON]

This algorithm returns an object with a single property that is an array, instead of just returning an array, so that it is possible to extend the algorithm in the future if necessary. 此算法返回一个具有单个属性的对象,该属性是一个数组,而不是仅仅返回一个数组,因此可以在将来扩展算法(如有必要)。

When the user agent is to get the object for an item item, optionally with a list of elements memory, it must run the following substeps: 当用户代理要获取项目项的对象时,可选择使用元素内存列表,它必须运行以下子步骤:

Let result be an empty object. 让结果成为一个空对象。

If no memory was passed to the algorithm, let memory be an empty list. 如果没有内存传递给算法,请将内存作为空列表。

Add item to memory. 将项目添加到内存中。

If the item has any item types, add an entry to result called "type" whose value is an array listing the item types of item, in the order they were specified on the itemtype attribute. 如果项目具有任何项目类型,请在结果中添加一个名为“type”的结果,其值是一个列出项目项目类型的数组,按照在itemtype属性上指定的顺序排列。

If the item has a global identifier, add an entry to result called "id" whose value is the global identifier of item. 如果项具有全局标识符,请向结果中添加一个名为“id”的条目,其值为项的全局标识符。

Let properties be an empty object. 设属性为空对象。

For each element element that has one or more property names and is one of the properties of the item item, in the order those elements are given by the algorithm that returns the properties of an item, run the following substeps: 对于具有一个或多个属性名称且属于项目项属性的每个元素元素,按照返回项的属性的算法给出这些元素的顺序,运行以下子步骤:

Let value be the property value of element. 设value为元素的属性值。

If value is an item, then: If value is in memory, then let value be the string "ERROR". 如果value是一个项,则:如果value在内存中,则将value设为字符串“ERROR”。 Otherwise, get the object for value, passing a copy of memory, and then replace value with the object returned from those steps. 否则,获取value对象,传递内存副本,然后将value替换为从这些步骤返回的对象。

For each name name in element's property names, run the following substeps: 对于元素属性名称中的每个名称,运行以下子步骤:

If there is no entry named name in properties, then add an entry named name to properties whose value is an empty array. 如果属性中没有名为name的条目,则将名为name的条目添加到值为空数组的属性。

Append value to the entry named name in properties. 将值附加到属性中名为name的条目。

Add an entry to result called "properties" whose value is the object properties. 在结果中添加一个名为“properties”的条目,其值为对象属性。

Return result. 返回结果。

 var result = {}; var items = []; document.querySelectorAll("[itemscope]") .forEach(function(el, i) { var item = { "type": [el.getAttribute("itemtype")], "properties": {} }; var props = el.querySelectorAll("[itemprop]"); props.forEach(function(prop) { item.properties[prop.getAttribute("itemprop")] = [ prop.content || prop.textContent || prop.src ]; if (prop.matches("[itemscope]") && prop.matches("[itemprop]")) { var _item = { "type": [prop.getAttribute("itemtype")], "properties": {} }; prop.querySelectorAll("[itemprop]") .forEach(function(_prop) { _item.properties[_prop.getAttribute("itemprop")] = [ _prop.content || _prop.textContent || _prop.src ]; }); item.properties[prop.getAttribute("itemprop")] = [_item]; } }); items.push(item) }) result.items = items; console.log(result); document.body .insertAdjacentHTML("beforeend", "<pre>" + JSON.stringify(result, null, 2) + "<pre>"); var props = ["Blendmagic", "ratingValue"]; // get the 'content' corresponding to itemprop 'ratingValue' // for item prop-name 'Blendmagic' var data = result.items.map(function(value, key) { if (value.properties.name && value.properties.name[0] === props[0]) { var prop = value.properties.reviews[0].properties; var res = {}, _props = {}; _props[props[1]] = prop[props[1]]; res[props[0]] = _props return res }; })[0]; console.log(data); document.querySelector("pre").insertAdjacentHTML("beforebegin", "<pre>" + JSON.stringify(result, null, 2) + "<pre>"); 
 <!DOCTYPE html> <html> <head> </head> <body> <div itemscope itemtype="http://schema.org/Offer"> <span itemprop="name">Blendmagic</span> <span itemprop="price">$19.95</span> <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating"> <img data-src="four-stars.jpg" /> <meta itemprop="ratingValue" content="4" /> <meta itemprop="bestRating" content="5" />Based on <span itemprop="ratingCount">25</span> user ratings </div> </div> <div itemscope itemtype="http://schema.org/Offer"> <span itemprop="name">testMagic</span> <span itemprop="price">$10.95</span> <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating"> <img data-src="four-stars.jpg" /> <meta itemprop="ratingValue" content="4" /> <meta itemprop="bestRating" content="5" />Based on <span itemprop="ratingCount">25</span> user ratings </div> </div> </body> </html> 

See also Recursion and loops of Microdata items 另请参阅微数据项的递归和循环

Check this Fiddle 检查这个小提琴

$("span[itemprop='name']").each(function(e) {
    if ($(arguments[1]).text() == 'Blendmagic') {
        alert($($("meta[itemprop='ratingValue']")[e]).attr('content'));       
    }    
});

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM