简体   繁体   中英

Sigil editor: Regex string to look for a (hyphen) character in text, but not html attributes

My problem: I use Sigil to edit xhtml files of an ebook.

When exporting from InDesign to ePub I tick option to remove forced line breaks. This act removes all - hyphen characters which are auto-generated by InDesign, but the characters which were added manually during my word-break fine-tune remain in the text. Current ability of Sigil search: searching by - parses everything, including css class names.

TODO: How to construct regex query which finds the - within the text, but not in the html code? Thank you!


What I have already tried: https://www.mobileread.com/forums/showpost.php?p=4099971&postcount=169 :

Here is a simple example to find the word "title" not inside a tag itself, here is the simplest regex search I could think of off the top of my head. It assumes there is no bare text in the body tag and that the xhtml is well formed.

I tried it and it appears to work. There are probably better more exhaustive regex, that can handle even broken xhtml.

Code:

title(?=[^>]*<)

This basically says search for "title" but lookahead to make sure there are no closing tag chars ">" before you find the next opening tag char "<".

There are probably look behind versions that could work with reverse logic. And there are ways to use regex to find a two strings that ignores any intervening tags.

Give it a try. You could add a saved search easily to do that. But again it will not handle find and replacement of text that crosses over elements (over nodes in the tree). That is the hard part unless you have one to one corresponding matching of matching substrings to replacement substrings which in general need not be the case.

And of course if you use &lt; and &gt; inside strings to show a "tag" or code snippet, these would be found by mistake so reviewing each find before the replace would be needed.

In Sigil , PCRE regex engine is used.

Thus, you can use

<[^<>]*>(*SKIP)(*F)|-

See the regex demo .

Details :

  • <[^<>]*>(*SKIP)(*F) - matches < , zero or more chars other than < and > and then a > , and then skips the match and goes on to search for the next match from the position where the failure occurred
  • | - or
  • - - a hyphen.

NOTE : you might want to match any dashes with [\p{Pd}\x{00AD}] (to replace with - ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM