[英]Replace non-ASCII characters with SGML entity codes with Emacs
I have a HTML file with a few non-ASCII characters, say encoded in UTF-8 or UTF-16. 我有一个HTML文件,其中包含一些非ASCII字符,例如用UTF-8或UTF-16编码。 To save the file in ASCII, I would like to replace them with their (SGML/HTML/XML) entity codes.
为了将文件保存为ASCII,我想用其(SGML / HTML / XML)实体代码替换它们。 So for example, every
ë
should become ë
因此,例如,每个
ë
都应变为ë
and every ◊
should become ◊
每个
◊
都应变成◊
. 。 How do I do that?
我怎么做?
I use Emacs as an editor. 我使用Emacs作为编辑器。 I'm sure it has a function to do the replace, but I cannot find it.
我确定它具有执行替换的功能,但是我找不到它。 What am I missing?
我想念什么? Or how do I implement it myself?
还是我自己实现它?
I searched high and low but it seems Emacs (or at least version 24.3.1) doesn't have such a function. 我上下搜索,但似乎Emacs(或至少是24.3.1版)没有这样的功能。 Nor can I find it somewhere.
在某个地方也找不到。
Based on a similar (but different) function I did find, I implemented it myself: 基于我发现的类似 (但不同)的功能,我自己实现了它:
(require 'cl)
(defun html-nonascii-to-entities (string)
"Replace any non-ascii characters with HTML (actually SGML) entity codes."
(mapconcat
#'(lambda (char)
(case char
(t (if (and (<= 8 char)
(<= char 126))
(char-to-string char)
(format "&#%02d;" char)))))
string
""))
(defun html-nonascii-to-entities-region (region-begin region-end)
"Replace any non-ascii characters with HTML (actually SGML) entity codes."
(interactive "r")
(save-excursion
(let ((escaped (html-nonascii-to-entities (buffer-substring region-begin region-end))))
(delete-region region-begin region-end)
(goto-char region-begin)
(insert escaped))))
I'm no Elisp guru at all, but this works! 我根本不是Elisp大师,但这行得通!
I also found find-next-unsafe-char to be of value. 我还发现find-next-unsafe-char很有价值。
Edit: an interactive version! 编辑:交互式版本!
(defun query-replace-nonascii-with-entities ()
"Replace any non-ascii characters with HTML (actually SGML) entity codes."
(interactive)
(perform-replace "[^[:ascii:]]"
`((lambda (data count)
(format "&#%02d;" ; Hex: "&#x%x;"
(string-to-char (match-string 0)))))
t t nil))
我认为您正在寻找iso-iso2sgml
There is a character class which includes exactly the ASCII character set. 有一个字符类 ,其中恰好包含ASCII字符集。 You can use a regexp that matches its complement to find occurrences of non-ASCII characters, and then replace them with their codes using elisp:
您可以使用与其补语匹配的正则表达式来查找非ASCII字符的出现,然后使用elisp将其替换为其代码:
M-x replace-regexp RET
[^[:ascii:]] RET
\,(concat "&#" (number-to-string (string-to-char \&)) ";") RET
So when, for example, á
is matched: \\&
is "á"
, string-to-char
converts it to ?á
(= the number 225), and number-to-string
converts that to "225"
. 因此,例如,当
á
匹配时: \\&
是"á"
, string-to-char
将其转换为?á
(= 225), number-to-string
将其转换为"225"
。 Then, concat
concatenates "&#"
, "225"
and ";"
然后,
concat
连接"&#"
, "225"
和";"
to get "á"
得到
"á"
, which replaces the original match. ,它将替换原始匹配项。
Surround these commands with Cx (
and Cx )
, and apply Cx Ck n
and Mx insert-kbd-macro
as usual to make a function out of them. 用
Cx (
和Cx )
包围这些命令,并像往常一样应用Cx Ck n
和Mx insert-kbd-macro
来使它们发挥作用。
To see the elisp equivalent of calling this function interactively, run the command and then press Cx M-:
(Repeat complex command). 要查看以交互方式调用此函数的方式,请运行命令,然后按
Cx M-:
重复复杂命令)。
A simpler version, which doesn't take into account the active region, could be: 一个不考虑活动区域的简单版本可能是:
(while (re-search-forward "[^[:ascii:]]" nil t)
(replace-match (concat "&#"
(number-to-string (string-to-char (match-string 0)))
";")))
(This uses the recommended way to do search + replace programmatically.) (这使用推荐的方式以编程方式进行搜索和替换。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.