简体   繁体   English

PowerShell - escaping 花式单引号和双引号,用于正则表达式和字符串替换

[英]PowerShell - escaping fancy single and double quotes for regex and string replace

I'm working with HTML files created by Acrobat, which doesn't use proper HTML entities to escape Unicode characters.我正在使用由 Acrobat 创建的 HTML 文件,它没有使用正确的 HTML 实体来转义 Unicode 字符。 I need to include single and double right quotation marks in a regex pattern, but every attempt I've made at escaping these characters has failed in my script...even if it works from a regular PowerShell session.我需要在正则表达式模式中包含单引号和双引号,但是我在 escaping 所做的每一次尝试都在我的脚本中失败了......即使它适用于常规的 PowerShell Z21D6F40CFB511982E4424E0E250。

For example, this find/replace does not work:例如,此查找/替换不起作用:

    $html = $html.Replace("`“", '“')
    $html = $html.Replace("`”", '”')
    $html = $html.Replace("`‘", '‘')
    $html = $html.Replace("`’", '’')

...but it does work if I break into my script and run one of those replace lines from the debug prompt. ...但是如果我闯入我的脚本并从调试提示符运行这些替换行之一,它确实有效。

Edit: Here's a snippet of the markup I'm testing with right now:编辑:这是我现在正在测试的标记片段:

<p style="padding-left: 5pt;text-indent: 17pt;line-height: 119%;text-align: justify;">To guide its readers the Hermetica makes use of the mystical astrological world-view that we have been discussing. It describes the creation of the world as a series of emanations, starting with the Light, who gave birth to a son called Logos. In the words of Hermes’s guide, Poimandres:</p><p style="padding-left: 24pt;text-indent: 0pt;line-height: 119%;text-align: justify;">“That Light,” he said, “is I, even Mind, the first God, who was before the watery substance which appeared out of the darkness; and the Logos which came forth the Light is son of God.”</p><p style="padding-left: 21pt;text-indent: 1pt;line-height: 119%;text-align: justify;">(Scott, Walter, translator, Hermetica: The Ancient Greek and Latin Writings Which Contain Religious or Philosophical Teachings Ascribed to Hermes Trismegistus, Boston: Shambhala: 1985, p. 117)</p>

If $html equals that string, my attempts to find and replace the characters appear to be futile.如果$html等于该字符串,那么我查找和替换字符的尝试似乎是徒劳的。

Try using the Unicode values instead of backquoting the literal:尝试使用 Unicode 值而不是反引用文字:

    $html = $html.Replace("`u{201C}", '&ldquo;')
    $html = $html.Replace("`u{201D}", '&rdquo;')
    $html = $html.Replace("`u{2018}", '&lsquo;')
    $html = $html.Replace("`u{2019}", '&rsquo;')

Evidently, PowerShell does funny things with non-BOM UTF-8 encoding.显然,PowerShell 用非 BOM UTF-8 编码做了一些有趣的事情。 Setting VSCode to auto-encode PowerShell scripts as UTF-8 with BOM allows the String.Replace function to operate as expected. 将 VSCode 设置为使用 BOM 将 PowerShell 脚本自动编码为 UTF-8允许 String.Replace function 按预期运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM