用Emacs将非ASCII字符替换为SGML实体代码

Question

I have a HTML file with a few non-ASCII characters, say encoded in UTF-8 or UTF-16. 我有一个HTML文件，其中包含一些非ASCII字符，例如用UTF-8或UTF-16编码。 To save the file in ASCII, I would like to replace them with their (SGML/HTML/XML) entity codes. 为了将文件保存为ASCII，我想用其（SGML / HTML / XML）实体代码替换它们。 So for example, every ë should become ë 因此，例如，每个ë都应变为ë and every ◊ should become ◊ 每个◊都应变成◊ . 。 How do I do that? 我怎么做？

I use Emacs as an editor. 我使用Emacs作为编辑器。 I'm sure it has a function to do the replace, but I cannot find it. 我确定它具有执行替换的功能，但是我找不到它。 What am I missing? 我想念什么？ Or how do I implement it myself? 还是我自己实现它？

Answer 1

I searched high and low but it seems Emacs (or at least version 24.3.1) doesn't have such a function. 我上下搜索，但似乎Emacs（或至少是24.3.1版）没有这样的功能。 Nor can I find it somewhere. 在某个地方也找不到。

Based on a similar (but different) function I did find, I implemented it myself: 基于我发现的类似（但不同）的功能，我自己实现了它：

(require 'cl)
(defun html-nonascii-to-entities (string)
  "Replace any non-ascii characters with HTML (actually SGML) entity codes."
  (mapconcat
   #'(lambda (char)
       (case char
             (t (if (and (<= 8 char)
                         (<= char 126))
                    (char-to-string char)
                  (format "&#%02d;" char)))))
   string
   ""))
(defun html-nonascii-to-entities-region (region-begin region-end)
  "Replace any non-ascii characters with HTML (actually SGML) entity codes."
  (interactive "r")
  (save-excursion
    (let ((escaped (html-nonascii-to-entities (buffer-substring region-begin region-end))))
      (delete-region region-begin region-end)
      (goto-char region-begin)
      (insert escaped))))

I'm no Elisp guru at all, but this works! 我根本不是Elisp大师，但这行得通！

I also found find-next-unsafe-char to be of value. 我还发现find-next-unsafe-char很有价值。

Edit: an interactive version! 编辑：交互式版本！

(defun query-replace-nonascii-with-entities ()
  "Replace any non-ascii characters with HTML (actually SGML) entity codes."
  (interactive)
  (perform-replace "[^[:ascii:]]"
                   `((lambda (data count)
                       (format "&#%02d;" ; Hex: "&#x%x;"
                               (string-to-char (match-string 0)))))
                     t t nil))

Answer 2

我认为您正在寻找iso-iso2sgml

Answer 3

There is a character class which includes exactly the ASCII character set. 有一个字符类，其中恰好包含ASCII字符集。 You can use a regexp that matches its complement to find occurrences of non-ASCII characters, and then replace them with their codes using elisp: 您可以使用与其补语匹配的正则表达式来查找非ASCII字符的出现，然后使用elisp将其替换为其代码：

M-x replace-regexp RET
[^[:ascii:]] RET
\,(concat "&#" (number-to-string (string-to-char \&)) ";") RET

So when, for example, á is matched: \\& is "á" , string-to-char converts it to ?á (= the number 225), and number-to-string converts that to "225" . 因此，例如，当á匹配时： \\&是"á" ， string-to-char将其转换为?á （= 225）， number-to-string将其转换为"225" 。 Then, concat concatenates "&#" , "225" and ";" 然后， concat连接"&#" ， "225"和";" to get "á" 得到"á" , which replaces the original match. ，它将替换原始匹配项。

Surround these commands with Cx ( and Cx ) , and apply Cx Ck n and Mx insert-kbd-macro as usual to make a function out of them. 用Cx (和Cx )包围这些命令，并像往常一样应用Cx Ck n和Mx insert-kbd-macro来使它们发挥作用。

To see the elisp equivalent of calling this function interactively, run the command and then press Cx M-: (Repeat complex command). 要查看以交互方式调用此函数的方式，请运行命令，然后按Cx M-:重复复杂命令）。

A simpler version, which doesn't take into account the active region, could be: 一个不考虑活动区域的简单版本可能是：

(while (re-search-forward "[^[:ascii:]]" nil t)
  (replace-match (concat "&#"
                         (number-to-string (string-to-char (match-string 0)))
                         ";")))

(This uses the recommended way to do search + replace programmatically.) （这使用推荐的方式以编程方式进行搜索和替换。）

用Emacs将非ASCII字符替换为SGML实体代码

问题描述

3 个解决方案

解决方案1
4 2013-09-06 08:15:34

解决方案2
2 2013-09-06 09:41:08

解决方案3
2 已采纳 2013-09-07 10:28:01

用Emacs将非ASCII字符替换为SGML实体代码

问题描述

3 个解决方案

解决方案1 4 2013-09-06 08:15:34

解决方案2 2 2013-09-06 09:41:08

解决方案3 2 已采纳 2013-09-07 10:28:01

解决方案1
4 2013-09-06 08:15:34

解决方案2
2 2013-09-06 09:41:08

解决方案3
2 已采纳 2013-09-07 10:28:01