简体   繁体   中英

How to extract/parse HTML using Microdata

I am pretty new to Microdata.

I have a HTML string with Microdata. I am trying to figure out if it's possible to extract the required information dynamically using Microdata with JS or jQuery. Has anyone done this before?

Example HTML string: I am trying to get the 'content' corresponding to itemprop 'ratingValue' for item prop-name 'Blendmagic'

<html>
    <div itemscope itemtype="http://schema.org/Offer">
        <span itemprop="name">Blendmagic</span>
        <span itemprop="price">$19.95</span>
        <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
            <img src="four-stars.jpg" />
            <meta itemprop="ratingValue" content="4" />
            <meta itemprop="bestRating" content="5" />
            Based on <span itemprop="ratingCount">25</span> user ratings
        </div>
    </div>
    <div itemscope itemtype="http://schema.org/Offer">
        <span itemprop="name">testMagic</span>
        <span itemprop="price">$10.95</span>
        <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating">
            <img src="four-stars.jpg" />
            <meta itemprop="ratingValue" content="4" />
            <meta itemprop="bestRating" content="5" />
            Based on <span itemprop="ratingCount">25</span> user ratings
        </div>
    </div>
</html>

Try beginning at the root itemscope node , filter descendant elements having itemprop attributes; return object result containing array items holding Microdata item s.

This solution is based on the algorithm found at Microdata

7 Converting HTML to other formats

7.1 JSON

Given a list of nodes nodes in a Document, a user agent must run the following algorithm to extract the microdata from those nodes into a JSON form:

Let result be an empty object.

Let items be an empty array.

For each node in nodes, check if the element is a top-level microdata item, and if it is then get the object for that element and add it to items.

Add an entry to result called "items" whose value is the array items.

Return the result of serializing result to JSON in the shortest possible way (meaning no whitespace between tokens, no unnecessary zero digits in numbers, and only using Unicode escapes in strings for characters that do not have a dedicated escape sequence), and with a lowercase "e" used, when appropriate, in the representation of any numbers. [JSON]

This algorithm returns an object with a single property that is an array, instead of just returning an array, so that it is possible to extend the algorithm in the future if necessary.

When the user agent is to get the object for an item item, optionally with a list of elements memory, it must run the following substeps:

Let result be an empty object.

If no memory was passed to the algorithm, let memory be an empty list.

Add item to memory.

If the item has any item types, add an entry to result called "type" whose value is an array listing the item types of item, in the order they were specified on the itemtype attribute.

If the item has a global identifier, add an entry to result called "id" whose value is the global identifier of item.

Let properties be an empty object.

For each element element that has one or more property names and is one of the properties of the item item, in the order those elements are given by the algorithm that returns the properties of an item, run the following substeps:

Let value be the property value of element.

If value is an item, then: If value is in memory, then let value be the string "ERROR". Otherwise, get the object for value, passing a copy of memory, and then replace value with the object returned from those steps.

For each name name in element's property names, run the following substeps:

If there is no entry named name in properties, then add an entry named name to properties whose value is an empty array.

Append value to the entry named name in properties.

Add an entry to result called "properties" whose value is the object properties.

Return result.

 var result = {}; var items = []; document.querySelectorAll("[itemscope]") .forEach(function(el, i) { var item = { "type": [el.getAttribute("itemtype")], "properties": {} }; var props = el.querySelectorAll("[itemprop]"); props.forEach(function(prop) { item.properties[prop.getAttribute("itemprop")] = [ prop.content || prop.textContent || prop.src ]; if (prop.matches("[itemscope]") && prop.matches("[itemprop]")) { var _item = { "type": [prop.getAttribute("itemtype")], "properties": {} }; prop.querySelectorAll("[itemprop]") .forEach(function(_prop) { _item.properties[_prop.getAttribute("itemprop")] = [ _prop.content || _prop.textContent || _prop.src ]; }); item.properties[prop.getAttribute("itemprop")] = [_item]; } }); items.push(item) }) result.items = items; console.log(result); document.body .insertAdjacentHTML("beforeend", "<pre>" + JSON.stringify(result, null, 2) + "<pre>"); var props = ["Blendmagic", "ratingValue"]; // get the 'content' corresponding to itemprop 'ratingValue' // for item prop-name 'Blendmagic' var data = result.items.map(function(value, key) { if (value.properties.name && value.properties.name[0] === props[0]) { var prop = value.properties.reviews[0].properties; var res = {}, _props = {}; _props[props[1]] = prop[props[1]]; res[props[0]] = _props return res }; })[0]; console.log(data); document.querySelector("pre").insertAdjacentHTML("beforebegin", "<pre>" + JSON.stringify(result, null, 2) + "<pre>"); 
 <!DOCTYPE html> <html> <head> </head> <body> <div itemscope itemtype="http://schema.org/Offer"> <span itemprop="name">Blendmagic</span> <span itemprop="price">$19.95</span> <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating"> <img data-src="four-stars.jpg" /> <meta itemprop="ratingValue" content="4" /> <meta itemprop="bestRating" content="5" />Based on <span itemprop="ratingCount">25</span> user ratings </div> </div> <div itemscope itemtype="http://schema.org/Offer"> <span itemprop="name">testMagic</span> <span itemprop="price">$10.95</span> <div itemprop="reviews" itemscope itemtype="http://schema.org/AggregateRating"> <img data-src="four-stars.jpg" /> <meta itemprop="ratingValue" content="4" /> <meta itemprop="bestRating" content="5" />Based on <span itemprop="ratingCount">25</span> user ratings </div> </div> </body> </html> 

See also Recursion and loops of Microdata items

Check this Fiddle

$("span[itemprop='name']").each(function(e) {
    if ($(arguments[1]).text() == 'Blendmagic') {
        alert($($("meta[itemprop='ratingValue']")[e]).attr('content'));       
    }    
});

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM