简体   繁体   中英

PHP pattern search and replace

I have a bunch of rawr contents in database.

some containing string http://www.example.com/subfolder/name.pdf or /subfolder/name.pdf

I need a pattern replace on these to turn them into /wp-content/uploads/old/subfolder/name.pdf there can be many levels of subfolders! /subfolder1/subfolder2/subfolder3/file.pdf

The pattern for finding I use is

/http[^\s]+pdf/
/href="\/[^\s]+pdf/

But how to replace the pattern with another pattern? ( the example above ^ )

I have

search for /http:\/\/www.example.com(.*).pdf"/
replace with /wp-content/uploads/old$1.pdf"

search for /href="\/pdf(.*)\.pdf">/

this works fine until there are more than 1 pdf links in one table cell

example

<a href="/pdf/subdir/name.pdf">clickhere</a><a href="/pdf/subdir/name.pdf">2nd PDF</a>

this works fine until there are more than 1 pdf links in one table cell

The regex engine is greedy by default, and it consumes as much as it can attempting a match. In order to reverse this behaviour, you could use a lazy quantifier , as explained in this post: Greedy vs. Reluctant vs. Possessive Quantifiers . So you have to add an extra ? after a quantifier to attempt a match with as less as it can consume. To make your greedy construct lazy, use [^\\s]+? .

some containing string http://www.example.com/subfolder/name.pdf or /subfolder/name.pdf

But how to replace the pattern with another pattern?

As you can see, " http://www.example.com " is optional. You can make a part of your pattern optional with a (?:group) and a ? quantifier.

Pattern with an optional group:

(?:http://www\.example\.com)?/(\S+?)\.pdf
  • Don't forget to escape the dots, as they have a special meaning in regex.
  • Notice I used \\S (capital "S") instead of [^\\s] (they are both exactly the same).


One more thing, you may consider adding some boundaries in your pattern. I suggest using (?<!\\w) (not preceded by a word character) and \\b a word boundary to avoid a match as part of another word (as I commented in your question).

Regex:

(?<!\w)(?:http://www\.example\.com)?/(\S+?)\.pdf\b

Code:

$re = "@(?<!\\w)(?:http://www\\.example\\.com)?/(\\S+?)\\.pdf\\b@i"; 
$str = "some containing string http://www.example.com/subfolder/name.pdf
        or /subfolder/name.pdf
        <a href=\"/pdf/subdir/name.pdf\">clickhere</a>
        <a href=\"/pdf/subdir/name.pdf\">2nd PDF</a>"; 
$subst = "/wp-content/uploads/old/$1.pdf"; 

$result = preg_replace($re, $subst, $str);

Test in regex101

A sandbox example here: http://sandbox.onlinephpfunctions.com/code/cc47b98d16981b786cf2d573751b6a09a9725b90

$array = [
     "https://test.com/url/subfolder/subfolder/file.pdf",
     "https://test.com/url/subfolder1/subfolder/file.pdf",
     "/url/subfolder3/subfolder3/files.xml",
     "/url/subfolder/subfolder/file.pdf"
];

function setwpUrl($urls, $prepend) {
    for($i = 0; $i < count($urls); $i++) {
        preg_match_all("/(https?:\/\/[a-zA-Z0-9\.\-]+)?(.*)/", $urls[$i], $out);
        $urls[$i] = $prepend . $out[2][0];
    }
    return $urls;
}

$newUrls = setwpUrl($array, "/wp-content/uploads/old");

var_dump($newUrls);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM