简体   繁体   中英

Search & replace regex - filtering files

little bit of background: I work at a multilingual communication company, where we're working with a CMS system. Since its last update, all the files I export out of the system are 'polluted' with metadata, which I don't want to see, use or replace. To filter and change a heap of xml files, I use Powergrep, which operates with regexes.

I want my regex to find, eg "there is no spoon", "oracle", "I know kung-fu" and "bending method" (all straight quotation marks) and replace it with “there is no spoon”, “oracle”, “I know kung-fu” and “bending method” (all with curly quotation marks).

I don't want it to find the metadata "concept.dtd" and "map.dtd" The following lines are the first lines of my xml file. It's this "concept.dtd" that I would like to ignore.

<?xml version="1.0" encoding="UTF-16" standalone="no"?>
<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"[
]>
<?ish ishref="GUID-6B84EF92-DA99-4C54-BA91-FD0A113D4A96" version="1" lang="sv" srclng="en"?>

This is somewhere in the middle of the xml file

<row>
<entry colname="col1" valign="middle" align="left">"Bending method" </entry>
<entry colname="col2" valign="middle" align="left">another word</entry>
</row>

So.. this is the original regex:

(?<!=)”\b(.+?)\b”(?! \[)

Replacement:

“1”

Problem: As the metadata “concept.dtd” and “map.dtd” are part of the file, I don't want to replace their quotation marks in order not to change anything crucial. So I tried rewriting the regex:

(?<!=)”\b(.+?[\.d])\b”(?! \[)

It almost works: “concept.dtd” and “map.dtd” are skipped, most of the terms between quotation marks are found, but not all: “Bending method” is not found, for example.

What am I missing? Any help or opinions would be greatly appreciated!

Based on your last answers, here is a regexp that can help you:

(?<=<entry)[^>]+>[^<>]*?(".+?")[^<>]*?(?=<\x2Fentry>)

Description

正则表达式可视化

Demo

http://regex101.com/r/lX2cU3

Discussion

I assume that you have one serie of words between straight quotations and that there are no carriage returns ou line feeds inside an <entry> node.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM