[英]how to scrape a <script type = “text / javascript”> tag in php?
<script type="text/javascript">
var BCData = {"csrf_token":"686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8","product_attributes":{"purchasable":true,"purchasing_message":null,"sku":"STICKER_PACK","upc":null,"stock":null,"instock":true,"stock_message":null,"weight":null,"base":false,"image":null,"price":{"without_tax":{"formatted":"$3.99","value":3.99,"currency":"USD"},"tax_label":"Tax"},"out_of_stock_behavior":"label_option","out_of_stock_message":"Out of stock","available_modifier_values":[],"available_variant_values":[7375],"in_stock_attributes":[7375],"selected_attributes":[]}};
</script>
what I want to extract is the value of csrf_token or 686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8
我要提取的是 csrf_token 或
686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8
的值
I already tried as below but did not get the result I expected我已经尝试如下但没有得到我预期的结果
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL, '$url');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36');
curl_setopt($ch,CURLOPT_HTTPHEADER,array("accept-language: es-419,es;q=0.9"));
curl_setopt($ch,CURLOPT_TIMEOUT, 10);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
preg_match_all('(<script type="text/javascript">
var BCData = {"csrf_token":\"(.*)\","product_attributes":{"purchasable":true,"purchasing_message":null,"sku":"STICKER_PACK","upc":null,"stock":null,"instock":true,"stock_message":null,"weight":null,"base":false,"image":null,"price":{"without_tax":{"formatted":"$3.99","value":3.99,"currency":"USD"},"tax_label":"Tax"},"out_of_stock_behavior":"label_option","out_of_stock_message":"Out of stock","available_modifier_values":[],"available_variant_values":[7375],"in_stock_attributes":[7375],"selected_attributes":[]}};</script>)siU', $result, $matches1);
$titulo = $matches1[1][0];
echo $titulo;
You can probably grab the variable BCData and then convert it into JSON:您可能可以获取变量 BCData,然后将其转换为 JSON:
$data = preg_match_all('/var\s+BCData\s*=\s*({.*?});/m', $result , $matches);
if (!empty($matches[1]) && !empty($matches[1][0])) {
$data = json_decode($matches[1][0], true);
echo $data['csrf_token'];
}
This assumes that the code will have a JSON valid value within the script tag, which seems to be true now, but may not be true forever.这假设代码将在脚本标记中具有 JSON 有效值,这现在似乎是正确的,但可能不会永远正确。
For reliability, the whole html document should be parsed by a DOM parser to isolate the <script>
node.为了可靠性,整个 html 文档应该由 DOM 解析器解析以隔离
<script>
节点。
Then use regex to carve out the json string.然后使用正则表达式来雕刻出json字符串。 The
m
modifier makes ^
match the start of a line and $
match the end of a line. m
修饰符使^
匹配行首, $
匹配行尾。 \\K
restarts the fullstring match so that no capture groups are needed. \\K
重新启动全字符串匹配,因此不需要捕获组。
Then, for reliability, parse the json string and access the desired value by key.然后,为了可靠性,解析 json 字符串并通过键访问所需的值。
$html = <<<HTML
<script type="text/javascript">
var BCData = {"csrf_token":"686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8","product_attributes":{"purchasable":true,"purchasing_message":null,"sku":"STICKER_PACK","upc":null,"stock":null,"instock":true,"stock_message":null,"weight":null,"base":false,"image":null,"price":{"without_tax":{"formatted":"$3.99","value":3.99,"currency":"USD"},"tax_label":"Tax"},"out_of_stock_behavior":"label_option","out_of_stock_message":"Out of stock","available_modifier_values":[],"available_variant_values":[7375],"in_stock_attributes":[7375],"selected_attributes":[]}};
</script>
HTML;
echo preg_match(
'~^var BCData = \K.*(?=;$)~m',
$html,
$match
)
? json_decode($match[0])->csrf_token
: 'pattern found no match';
Output:输出:
686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8
Admittedly, I don't know how the input string may vary so I can only build a pattern for the string provided.诚然,我不知道输入字符串会如何变化,所以我只能为所提供的字符串构建一个模式。
The simplest expression to extract the CSRF from the page:从页面中提取CSRF的最简单的表达式:
# matches all occurrences of the format of the CSRF token
if (preg_match_all('/[a-f0-9]{64}/', $string, $matches))
{
# should equal the value of the transmitted CSRF
print_r($matches[0][0]);
}
This specifically matches multiple instances of the "csrf_token":"..."
portion of the JSON and extracts the token value in a named group这特别匹配 JSON 的
"csrf_token":"..."
部分的多个实例,并在命名组中提取令牌值
// Match all occurrences
if (preg_match_all('/\"csrf_token\"\s?\:\s?\"(?<csrf>[a-f0-9]{64})\"/', $string, $matches)) {
// One or more token matches extracted from the JSON
print_r($matches['csrf']);
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.