如何刮一个<script type = “text / javascript”> tag in php?

Question

my question is how can I scrape this tag我的问题是我怎样才能刮掉这个标签

<script type="text/javascript">
var BCData = {"csrf_token":"686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8","product_attributes":{"purchasable":true,"purchasing_message":null,"sku":"STICKER_PACK","upc":null,"stock":null,"instock":true,"stock_message":null,"weight":null,"base":false,"image":null,"price":{"without_tax":{"formatted":"$3.99","value":3.99,"currency":"USD"},"tax_label":"Tax"},"out_of_stock_behavior":"label_option","out_of_stock_message":"Out of stock","available_modifier_values":[],"available_variant_values":[7375],"in_stock_attributes":[7375],"selected_attributes":[]}};
</script>

what I want to extract is the value of csrf_token or 686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8我要提取的是 csrf_token 或686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8的值
I already tried as below but did not get the result I expected我已经尝试如下但没有得到我预期的结果

$ch = curl_init();
curl_setopt($ch,CURLOPT_URL, '$url');
curl_setopt($ch,CURLOPT_USERAGENT,'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36');
curl_setopt($ch,CURLOPT_HTTPHEADER,array("accept-language: es-419,es;q=0.9"));
curl_setopt($ch,CURLOPT_TIMEOUT, 10);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);

preg_match_all('(<script type="text/javascript">
var BCData = {"csrf_token":\"(.*)\","product_attributes":{"purchasable":true,"purchasing_message":null,"sku":"STICKER_PACK","upc":null,"stock":null,"instock":true,"stock_message":null,"weight":null,"base":false,"image":null,"price":{"without_tax":{"formatted":"$3.99","value":3.99,"currency":"USD"},"tax_label":"Tax"},"out_of_stock_behavior":"label_option","out_of_stock_message":"Out of stock","available_modifier_values":[],"available_variant_values":[7375],"in_stock_attributes":[7375],"selected_attributes":[]}};</script>)siU', $result, $matches1);
$titulo = $matches1[1][0];
echo $titulo;

I can't get the result我无法得到结果

Answer 1

You can probably grab the variable BCData and then convert it into JSON:您可能可以获取变量 BCData，然后将其转换为 JSON：

$data = preg_match_all('/var\s+BCData\s*=\s*({.*?});/m', $result , $matches);
if (!empty($matches[1]) && !empty($matches[1][0])) {
   $data = json_decode($matches[1][0], true);
   echo $data['csrf_token'];
}

This assumes that the code will have a JSON valid value within the script tag, which seems to be true now, but may not be true forever.这假设代码将在脚本标记中具有 JSON 有效值，这现在似乎是正确的，但可能不会永远正确。

Sandbox link 沙盒链接

Answer 2

For reliability, the whole html document should be parsed by a DOM parser to isolate the <script> node.为了可靠性，整个 html 文档应该由 DOM 解析器解析以隔离<script>节点。

Then use regex to carve out the json string.然后使用正则表达式来雕刻出json字符串。 The m modifier makes ^ match the start of a line and $ match the end of a line. m修饰符使^匹配行首， $匹配行尾。 \\K restarts the fullstring match so that no capture groups are needed. \\K重新启动全字符串匹配，因此不需要捕获组。

Then, for reliability, parse the json string and access the desired value by key.然后，为了可靠性，解析 json 字符串并通过键访问所需的值。

Code: ( Demo )代码：（演示）

$html = <<<HTML
<script type="text/javascript">
var BCData = {"csrf_token":"686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8","product_attributes":{"purchasable":true,"purchasing_message":null,"sku":"STICKER_PACK","upc":null,"stock":null,"instock":true,"stock_message":null,"weight":null,"base":false,"image":null,"price":{"without_tax":{"formatted":"$3.99","value":3.99,"currency":"USD"},"tax_label":"Tax"},"out_of_stock_behavior":"label_option","out_of_stock_message":"Out of stock","available_modifier_values":[],"available_variant_values":[7375],"in_stock_attributes":[7375],"selected_attributes":[]}};
</script>
HTML;

echo preg_match(
         '~^var BCData = \K.*(?=;$)~m',
         $html,
         $match
     )
     ? json_decode($match[0])->csrf_token
     : 'pattern found no match';

Output:输出：

686611cabde717e63c8ad811ac28ff1a2566168df14ec1439799dbfc0569f2c8

Admittedly, I don't know how the input string may vary so I can only build a pattern for the string provided.诚然，我不知道输入字符串会如何变化，所以我只能为所提供的字符串构建一个模式。

Answer 3

The simplest expression to extract the CSRF from the page:从页面中提取CSRF的最简单的表达式：

# matches all occurrences of the format of the CSRF token
if (preg_match_all('/[a-f0-9]{64}/', $string, $matches))
{
    # should equal the value of the transmitted CSRF
    print_r($matches[0][0]);
}

Answer 4

This specifically matches multiple instances of the "csrf_token":"..." portion of the JSON and extracts the token value in a named group这特别匹配 JSON 的"csrf_token":"..."部分的多个实例，并在命名组中提取令牌值


// Match all occurrences
if (preg_match_all('/\"csrf_token\"\s?\:\s?\"(?<csrf>[a-f0-9]{64})\"/', $string, $matches)) {

    // One or more token matches extracted from the JSON
    print_r($matches['csrf']);

}

如何刮一个<script type = “text / javascript”> tag in php?

问题描述

4 个解决方案

解决方案1
1 2020-06-27 09:19:50

解决方案2
1 2020-06-27 10:19:55

解决方案3
0 2020-06-27 04:39:42

解决方案4
0 2020-06-27 09:16:33

如何刮一个<script type = “text / javascript”> tag in php?

问题描述

4 个解决方案

解决方案1 1 2020-06-27 09:19:50

解决方案2 1 2020-06-27 10:19:55

解决方案3 0 2020-06-27 04:39:42

解决方案4 0 2020-06-27 09:16:33

解决方案1
1 2020-06-27 09:19:50

解决方案2
1 2020-06-27 10:19:55

解决方案3
0 2020-06-27 04:39:42

解决方案4
0 2020-06-27 09:16:33