I've been trying to parse a website, through using DOMelements. Everything was working properly, except from this issue which doesn't make sense to me.
There is a select box, and I need the contents of all its possible option values:
<select name="super_attribute[141]" id="attribute141" class="required-entry super-attribute-select">
<option value="">Choose size</option>
<option value="36" price="0">36</option>
<option value="38" price="0">38</option>
<option value="41" price="0">40</option>
<option value="43" price="0">42</option>
<option value="45" price="0">44</option>
<option value="47" price="0">46</option>
<option value="49" price="0">48</option>
</select>
I want to retrieve an array containing the values (either of innerHTML or 'value' attribute). I use this code:
foreach ($dom->getElementsByTagName('option') as $option_tag) {
$sizes_list[] = $option_tag->getAttribute('value');
}
However there is only always one 'option' tag returned, with an empty value. So I tried a different approach:
$item_options = $dom->getElementById('attribute141');
print(sizeof($item_options->childNodes)); // Prints "1"
foreach ($item_options->childNodes as $child) {
$sizes_list[] = $child->getAttribute('value');
}
$cloth_item->setSizes($sizes_list);
And again it seems to find this single empty value ... Why cannot I access the rest of the options?
When you parse a HTML page from an URL, you must not ever refer to browser page inspector, because inspector shows source after DOM/js parsing. You need to refer to “View page source” browser command, or — better — to do this in php:
$html = file_get_contents( 'http://www.example.com/your/url.html' );
file_put_contents( '/Path/Local/Download/Page.html', $html );
Then, open downloaded file with a text editor to see the real HTML with which you are working.
In your specific case, you can retrieve only one <option>
because... there is only one <option>
in loaded page:
<div class="input-box">
<select name="super_attribute[141]" id="attribute141" class="required-entry super-attribute-select">
<option>בחר אפשרות...</option>
</select>
</div>
Other options are loaded by JavaScript. Their values are stored in JSON format inside a script in the same page. There is not a clean way to retrieve it. You can use PhantomJS , but — as you can see here or on other Stack Overflow questions — this way is not easy using php.
A dirty way can be this: looking at HTML source, you can see that your data is in this format:
<script type="text/javascript">
var spConfig = new Product.Config({ (...) });
</script>
So, you can retrieve all <script>
nodes and search for new Product.Config
value.
With pure DOM:
$nodes = $dom->getElementsByTagName('script'); // Result: 70 nodes
Using DOMXPath :
$xpath = new DOMXPath( $dom );
$nodes = $xpath->query('//script[@type="text/javascript"]'); // Result: 58 nodes
Then, loop through all nodes, find for a regular expression pattern and decode it:
foreach( $nodes as $node )
{
if( preg_match( '~new Product\.Config\((.+?)\);~', $node->nodeValue, $matches ) )
{
$data = json_decode( $matches[1] );
break;
}
}
At this point, in $data
you have this decoded JSON:
stdClass Object
(
[attributes] => stdClass Object
(
[141] => stdClass Object
(
[id] => 141
[code] => size
[label] => מידה
[options] => Array
(
[0] => stdClass Object
(
[id] => 36
[label] => 36
[price] => 0
[oldPrice] => 0
[products] => Array
(
[0] => 93548
)
)
(...)
)
)
)
)
So to access to first <option>
id, you can use this:
echo $data->attributes->{141}->options[0]->id; // Output: 36
# ↑ note curly brackets to access to a not-valid property key
And so on:
echo $data->attributes->{141}->options[1]->id; // Output: 38
echo $data->attributes->{141}->options[1]->label; // Output: 38
echo $data->attributes->{141}->options[1]->price; // Output: 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.