简体   繁体   中英

php regular expression breaks

I have the following string in an html.

BookSelector.load([{"index":25,"label":"Science","booktype":"pdf","payload":"<script type=\"text\/javascript\" charset=\"utf-8\" src=\"\/\/www.192.168.10.85\/libs\/js\/books.min.js\" publisher_id=\"890\"><\/script>"}]);

i want to find the src and the publisher_id from the string.

for this im trying the following code

$regex = '#\BookSelector.load\(.*?src=\"(.*?)\"}]\)#s';

preg_match($regex, $html, $matches);

$match = $matches[1];

but its always returning null.

what would be my regex to select the src only ?

what would be my regex if i need to parse the whole string between BookSelector.load ();

Why your regex isn't working?

First, I'll answer why your regex isn't working:

  1. You're using \\B in your regex. It matches any position not matched by a word boundary ( \\b ), which is not what you want. This condition fails, and causes the entire regex to fail.

  2. Your original text contains escaped quotes, but your regex doesn't account for those.

The correct approach to solve this problem

Split this task into several parts, and solve it one by one, using the best tool available.

  1. The data you need is encapsulated within a JSON structure. So the first step is obviously to extract the JSON content. For this purpose, you can use a regex.

  2. Once you have the JSON content, you need to decode it to get the data in it. PHP has a built-in function for that purpose: json_decode() . Use it with the input string and set the second parameter as true , and you'll have a nice associative array.

  3. Once you have the associative array, you can easily get the payload string, which contains the <script> tag contents.

  4. If you're absolutely sure that the order of attributes will always be the same, you can use a regex to extract the required information. If not, it's better to use an HTML parser such as PHP's DOMDocument to do this.

The whole code for this looks like:

// Extract the JSON string from the whole block of text
if (preg_match('/BookSelector\.load\((.*?)\);/s', $text, $matches)) {

    // Get the JSON string and decode it using json_decode()
    $json    = $matches[1];
    $content = json_decode($json, true)[0]['payload'];

    $dom = new DOMDocument;
    $dom->loadHTML($content);

    // Use DOMDocument to load the string, and get the required values
    $script_tag   = $dom->getElementsByTagName('script')->item(0);
    $script_src   = $tag->getAttribute('src');
    $publisher_id = $tag->getAttribute('publisher_id');

    var_dump($src, $publisher_id);
}

Output:

string(40) "//www.192.168.10.85/libs/js/books.min.js"
string(3) "890"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM