简体   繁体   中英

Regex Search and Replace Pattern Needed

I converted a pdf to an epub file using calibre. When I view the epub, I can see unnecessary line breaks when I view it on my smartphone.

I'd like to use regex to identify these situtations:

<lower_case_character><space_character></p><p class="calibre2"><lower_case_character>

and convert it to:

<lower_case_character><space_character><lower_case_character>

Can someone provide me the proper search and replace regex expressions?

Thanks.

I think you want to remove the unnecessary class attributes added by Calibre.. I dont know are trying to make a script that converts pdf to epub OR you want to edit ePub separately. To edit the ePub and remove the useless classes you can easily Extract the ePub File. You can use WinRar, to extract the contents of epub to a folder, Edit the HTML files generated. And then re-Zip it again to make it an ePub.

As long as the editor you are using has lookaround capabilities, try this for the "search":

(?<=[a-z])\b</p><p class="calibre\d">(?=[a-z])

In the "replace" simply put a space.

In the code above, this (?<=[az]) is a "positive lookbehind" that looks for, but does not replace a lower case letter preceding the block of text you want to replace.

Likewise, this (?=[az]) is a "positive lookahead" that looks for, but does not replace a lower case letter after the block of text you want to replace.

The \\d after "calibre" calibre\\d should catch other classes, such as calibre1 or calibre3, etc.

You can try it out here: http://gskinner.com/RegExr/

The following is a little more robust and will look for the Calibre tags that may have extra whitespace on either side of the tags:

(?<=[a-z])(\b|\s)(</p><p class="calibre\d">)(\b|\s)(?=[a-z])

Try this:

(?x) (?<! \. (co|d ) )
(?<C>\b\p{L}+) [-] \s* 
</p> \s*   (<(?<XX>div|p)[^<>]*>\s* </\g<XX>>\s* )* <p[^<>]*>
(?<D>[\p{L}]+\b )
|
(?x) (?<! \. (co|d ) )
(?<A>[\p{N}\p{L}–,—] )\s* (?<B>(</(\w+)>)*)?
</p> \s*   (<(?<XX>div|p)[^<>]*>\s* </\g<XX>>\s* )* <p[^<>]*>
(?<C>(<(\w+)\b[^<>]*>)*)?
 \s*(?<D>[\p{L}] )
|
(?x)(?-i)  (?<! \. (co|d ) )
(?<A>[\d\p{Ll}\p{N}] | \p{Ll}-)\s* (?<B>(</(\w+)>)*)?
</p> \s* (<(?<XX>div|p)[^<>]*>\s* </\g<XX>>\s* )*<p[^<>]*>
(?<C>(<(\w+)\b[^<>]*>)*)?
 \s*(?<D>[\p{Ll}] )  (?i)
|
(?x)(?-i)  (?<! \. (co|d ) )
(?<A>[’] | \p{L}-)\s* (?<B>(</(\w+)>)*)?
</p> \s*  (<(?<XX>div|p)[^<>]*>\s* </\g<XX>>\s* )*<p[^<>]*>
(?<C>(<(\w+)\b[^<>]*>)*)?
 \s*(?<D>[\p{L}] )  (?i)

|
(?x)(?i)  (?<! \. (co|d ) )
(?<A>\b (ca|Dr|Mr|Ms|Mrs|St) [.․] )\s* (?<B>(</(\w+)>)*)?
</p> \s*  (<(?<XX>div|p)[^<>]*>\s* </\g<XX>>\s* )*<p[^<>]*>
(?<C>(<(\w+)\b[^<>]*>)*)?
 \s*(?<D>[\p{L}] )  (?i)

Replace:

\g<A>\g<B> \g<C>\g<D>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM