简体   繁体   English

需要正则表达式搜索和替换模式

[英]Regex Search and Replace Pattern Needed

I converted a pdf to an epub file using calibre.我使用 calibre 将 pdf 转换为 epub 文件。 When I view the epub, I can see unnecessary line breaks when I view it on my smartphone.当我查看 epub 时,在智能手机上查看时会看到不必要的换行符。

I'd like to use regex to identify these situtations:我想使用正则表达式来识别这些情况:

<lower_case_character><space_character></p><p class="calibre2"><lower_case_character>

and convert it to:并将其转换为:

<lower_case_character><space_character><lower_case_character>

Can someone provide me the proper search and replace regex expressions?有人可以为我提供正确的搜索并替换正则表达式吗?

Thanks.谢谢。

I think you want to remove the unnecessary class attributes added by Calibre.. I dont know are trying to make a script that converts pdf to epub OR you want to edit ePub separately.我认为您想删除 Calibre 添加的不必要的类属性。我不知道是否正在尝试制作将 pdf 转换为 epub 的脚本,或者您想单独编辑 ePub。 To edit the ePub and remove the useless classes you can easily Extract the ePub File.要编辑 ePub 并删除无用的类,您可以轻松提取 ePub 文件。 You can use WinRar, to extract the contents of epub to a folder, Edit the HTML files generated.您可以使用WinRar,将epub 的内容解压到一个文件夹中,编辑生成的HTML 文件。 And then re-Zip it again to make it an ePub.然后再次重新压缩以使其成为 ePub。

As long as the editor you are using has lookaround capabilities, try this for the "search":只要您使用的编辑器具有环视功能,请尝试“搜索”:

(?<=[a-z])\b</p><p class="calibre\d">(?=[a-z])

In the "replace" simply put a space.在“替换”中简单地放一个空格。

In the code above, this (?<=[az]) is a "positive lookbehind" that looks for, but does not replace a lower case letter preceding the block of text you want to replace.在上面的代码中,这个(?<=[az])是一个“正向后视”,它寻找但不替换要替换的文本块前面的小写字母。

Likewise, this (?=[az]) is a "positive lookahead" that looks for, but does not replace a lower case letter after the block of text you want to replace.同样,这个(?=[az])是一个“正向前瞻”,它会查找但不替换要替换的文本块后面的小写字母。

The \\d after "calibre" calibre\\d should catch other classes, such as calibre1 or calibre3, etc. "calibre" calibre\\d \\d 之后的calibre\\d应该捕获其他类,例如 calibre1 或 calibre3 等。

You can try it out here: http://gskinner.com/RegExr/你可以在这里试试: http : //gskinner.com/RegExr/

The following is a little more robust and will look for the Calibre tags that may have extra whitespace on either side of the tags:下面的代码更健壮一些,将查找在标签两侧可能有额外空白的 Calibre 标签:

(?<=[a-z])(\b|\s)(</p><p class="calibre\d">)(\b|\s)(?=[a-z])

Try this:尝试这个:

(?x) (?<! \. (co|d ) )
(?<C>\b\p{L}+) [-] \s* 
</p> \s*   (<(?<XX>div|p)[^<>]*>\s* </\g<XX>>\s* )* <p[^<>]*>
(?<D>[\p{L}]+\b )
|
(?x) (?<! \. (co|d ) )
(?<A>[\p{N}\p{L}–,—] )\s* (?<B>(</(\w+)>)*)?
</p> \s*   (<(?<XX>div|p)[^<>]*>\s* </\g<XX>>\s* )* <p[^<>]*>
(?<C>(<(\w+)\b[^<>]*>)*)?
 \s*(?<D>[\p{L}] )
|
(?x)(?-i)  (?<! \. (co|d ) )
(?<A>[\d\p{Ll}\p{N}] | \p{Ll}-)\s* (?<B>(</(\w+)>)*)?
</p> \s* (<(?<XX>div|p)[^<>]*>\s* </\g<XX>>\s* )*<p[^<>]*>
(?<C>(<(\w+)\b[^<>]*>)*)?
 \s*(?<D>[\p{Ll}] )  (?i)
|
(?x)(?-i)  (?<! \. (co|d ) )
(?<A>[’] | \p{L}-)\s* (?<B>(</(\w+)>)*)?
</p> \s*  (<(?<XX>div|p)[^<>]*>\s* </\g<XX>>\s* )*<p[^<>]*>
(?<C>(<(\w+)\b[^<>]*>)*)?
 \s*(?<D>[\p{L}] )  (?i)

|
(?x)(?i)  (?<! \. (co|d ) )
(?<A>\b (ca|Dr|Mr|Ms|Mrs|St) [.․] )\s* (?<B>(</(\w+)>)*)?
</p> \s*  (<(?<XX>div|p)[^<>]*>\s* </\g<XX>>\s* )*<p[^<>]*>
(?<C>(<(\w+)\b[^<>]*>)*)?
 \s*(?<D>[\p{L}] )  (?i)

Replace:代替:

\g<A>\g<B> \g<C>\g<D>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM