简体   繁体   English

正则表达式:匹配指定字符串之间的所有匹配项

[英]Regex: Match all occurrences between specified strings

I'm dealing with a bunch of text files that refer to image filenames. 我正在处理一堆引用图像文件名的文本文件。 These filenames were sanitized (made lowercase and whitespace replaced with hyphens) - but the text referring to them was not. 这些文件名已经过清理(将小写字母和空格替换为连字符)-但是引用它们的文本却没有。

I need to transform strings like this: 我需要像这样转换字符串:

(image: uploaded IMAGE.jpg caption: this is my caption)
(image: uploaded IMAGE copy.jpeg caption: this is my caption)
(image: IMG_6087.png caption: this is my caption)
(image: IMG_6087 copy.gif)
(image: IMG_9999_copy.jpg)
(image: somehow, a comma.jpg)
(image: other ridic'ulous characters!.jpg)

to: 至:

(image: uploaded-image.jpg caption: this is my caption)
(image: uploaded-image-copy.jpeg caption: this is my caption)
(image: img_6087.png caption: this is my caption)
(image: img_6087-copy.gif)
(image: img_9999_copy.jpg)
(image: somehow-a-comma.jpg)
(image: other-ridiculous-characters.jpg)

These strings are parts of larger blocks of text, but are all on their own lines, like so: 这些字符串是较大的文本块的一部分,但都位于各自的行上,如下所示:

This is not a short guide to write about art. Go in, out of the window, inside New York’s stars qualities, dreams and schemes. People are gathered together, brewing coffee — you have seen their faces? The artists in Manhattan.

(image: manhattan photo.jpg)

Drive till sunset and say goodbye to your body, because this is not a photograph. I saw sixteen americans, raised by wolves, probably lost in paradise city. I found your head — Do you still want it?

I'm using Sublime text and was planning on doing multiple Replace Alls: 我正在使用Sublime文本,并计划进行多次替换操作:

  1. strip whitespace 带空格
  2. strip characters that are not alphanumeric or _ or - 去除不是字母数字或_或-的字符
  3. make lowercase 小写

But I can't manage to capture all instances of something between the two delimiters. 但是我无法捕获两个定界符之间的所有实例。

(?<=^\\(image: )[what do I do here??](?=\\.jpe?g|png|gif)

you can use non-greedy match-all .*? 您可以使用非贪婪的所有人.*?

so ^\\(image: (.*?\\.(:?jpe?g|png|gif)) to capture the filename including extension 因此^\\(image: (.*?\\.(:?jpe?g|png|gif))捕获包含扩展名的文件名

You can grab the filenames with: 您可以使用以下方法获取文件名:

(?<=image:\s)([^.]++)(?=\.jpe?g|\.png|\.gif)

After that, the transformations depend on the language that you're working in. Add file extensions as you need them. 之后,转换取决于您使用的语言。根据需要添加文件扩展名。 Right now you support jpg , jpeg , png , and gif . 现在,您支持jpgjpegpnggif

Here is a working way to do it in PHP 这是在PHP中完成此工作的方法

<?php
$string =
"This is not a short guide to write about art. Go in, out of the window, inside New York’s stars qualities, dreams and schemes. People are gathered together, brewing coffee — you have seen their faces? The artists in Manhattan.

(image: uploaded IMAGE.jpg caption: this is my caption)
This is not a short guide to write about art. Go in, out of the window, inside New York’s stars qualities, dreams and schemes. People are gathered together, brewing coffee — you have seen their faces? The artists in Manhattan.

(image: uploaded IMAGE copy.jpeg caption: this is my caption)
(image: IMG_6087.png caption: this is my caption)
(image: IMG_6087 copy.gif) blah blah
(image: IMG_9999_copy.jpg)
(image: somehow, a comma.jpg)
(image: other ridic'ulous characters!.jpg)";

echo preg_replace_callback('~(?<=\(image: )(.*?)\.(jpg|jpeg|png|gif)~', function($matches)
{
    return preg_replace('~\W~', '-', stripslashes(strtolower($matches[1]))) . ".$matches[2]";
}, $string);

?>

[EDIT] add regex explanation: [编辑]添加正则表达式说明:

  • (?<=image: ) : is a positive lookbehind - so checking the presence of 'image: ' but not capturing. (?<=image: ) :):是令人反感的-因此请检查'image:'的存在,但不能捕获。
  • (.*?) : captures everything before the image extension in a greedy way - so match as few text as possible. (.*?) :以贪婪的方式捕获图像扩展名之前的所有内容-因此匹配的文本越少越好。
  • \\.(jpg|jpeg|png|gif) : will match . \\.(jpg|jpeg|png|gif) :将匹配. literally + one of the given extensions - and capturing the extension to reuse. 从字面上看+给定的扩展之一-并捕获扩展以重用。
  • ~ : is the delimiter, this choice just because it is very seldom used in strings and won't need to \\ the / ~ :是分隔符,这种选择只是因为它是在字符串很少使用,不需要\\/
  • \\W : is the opposite of \\w and it will match any non-alphanumeric character. \\W :与\\w相反,它将匹配任何非字母数字字符。

Will output (in view source): 将输出(在视图源中):

This is not a short guide to write about art. Go in, out of the window, inside New York’s stars qualities, dreams and schemes. People are gathered together, brewing coffee — you have seen their faces? The artists in Manhattan.

(image: uploaded-image.jpg caption: this is my caption)
This is not a short guide to write about art. Go in, out of the window, inside New York’s stars qualities, dreams and schemes. People are gathered together, brewing coffee — you have seen their faces? The artists in Manhattan.

(image: uploaded-image-copy.jpeg caption: this is my caption)
(image: img_6087.png caption: this is my caption)
(image: img_6087-copy.gif) blah blah
(image: img_9999_copy.jpg)
(image: somehow--a-comma.jpg)
(image: other-ridic-ulous-characters-.jpg)

You can then fine-tune in the callback what character you want to transform into what, with str_replace() for instance. 然后,您可以使用str_replace()在回调中微调您想将什么字符转换成什么字符。

hope it helps! 希望能帮助到你! ;) ;)

Can you try Jetbrains webstrom front end IDE. 您可以尝试Jetbrains webstrom前端IDE吗? Which provides lot of capabilities to achieve any regex operations in readable way. 它提供了许多以可读方式实现任何正则表达式操作的功能。 Select a text you want to split are check for delimiters or any white-spaces. 选择要拆分的文本,检查是否有分隔符或任何空白。

You will get it for 30 days trail version . 您将获得30天试用版。 Also will share you the regex query shortly. 也将很快与您分享正则表达式查询。

Also checkout http://myregexp.com/ or some plugin to valid your regex queries 还可以检出http://myregexp.com/或某些插件来验证您的正则表达式查询

Online Regex editor 在线正则表达式编辑器

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM