使用python / django从字符串中删除非ASCII字符

Question

I have a string of HTML stored in a database. 我有一个存储在数据库中的HTML字符串。 Unfortunately it contains characters such as ® I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code. 不幸的是它包含诸如®之类的字符我希望用它们的HTML等效替换这些字符，无论是在DB本身还是在我的Python / Django代码中使用Find Replace。

Any suggestions on how I can do this? 有关如何做到这一点的任何建议？

Answer 1

You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range 您可以使用ASCII字符是前128个字符，因此请使用ord获取每个字符的编号，如果超出范围则将其删除

# -*- coding: utf-8 -*-

def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)


test = u'éáé123456tgreáé@€'
print test
print strip_non_ascii(test)

Result 结果

éáé123456tgreáé@€
123456tgre@

Please note that @ is included because, well, after all it's an ASCII character. 请注意， @包含在内，因为它毕竟是ASCII字符。 If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table 如果要剥离特定子集（例如只是数字和大写和小写字母），可以限制查看ASCII表的范围

EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. 编辑：再次阅读您的问题后，您可能需要转义HTML代码，因此所有这些字符在呈现后都会正确显示。 You can use the escape filter on your templates. 您可以在模板上使用escape过滤器。

Answer 2

I found this a while ago, so this isn't in any way my work. 我刚才发现了这一点，所以这绝不是我的工作。 I can't find the source, but here's the snippet from my code. 我找不到源代码，但这是我代码中的代码片段。

def unicode_escape(unistr):
    """
    Tidys up unicode entities into HTML friendly entities

    Takes a unicode string as an argument

    Returns a unicode string
    """
    import htmlentitydefs
    escaped = ""

    for char in unistr:
        if ord(char) in htmlentitydefs.codepoint2name:
            name = htmlentitydefs.codepoint2name.get(ord(char))
            entity = htmlentitydefs.name2codepoint.get(name)
            escaped +="&#" + str(entity)

        else:
            escaped += char

    return escaped

Use it like this 像这样使用它

>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as &#174 I want'

Answer 3

This code snippet may help you. 此代码段可能对您有所帮助。

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

def removeNonAscii(string):
    nonascii = bytearray(range(0x80, 0x100))
    return string.translate(None, nonascii)

nonascii_removed_string = removeNonAscii(string_to_remove_nonascii)

The encoding definition is very important here which is done in the second line. 编码定义在这里非常重要，这在第二行中完成。

Answer 4

There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481 在https://stackoverflow.com/a/18430817/5100481上有一个更简单的答案

To remove non-ASCII characters from a string, s , use: 从字符串，删除非ASCII字符s ，使用：

s = s.encode('ascii',errors='ignore')

Then convert it from bytes back to a string using: 然后使用以下命令将其从字节转换回字符串：

s = s.decode()

This all using Python 3.6 这一切都使用Python 3.6

Answer 5

To get rid of the special xml, html characters '<', '>', '&' you can use cgi.escape: 要摆脱特殊的xml，html字符'<'，'>'，'＆'，你可以使用cgi.escape：

import cgi
test = "1 < 4 & 4 > 1"
cgi.escape(test)

Will return: 将返回：

'1 &lt; 4 &amp; 4 &gt; 1'

This is probably the bare minimum you need to avoid problem. 这可能是您避免问题所需的最低限度。 For more you have to know the encoding of your string. 要获得更多信息，您必须知道字符串的编码。 If it fit the encoding of your html document you don't have to do something more. 如果它符合您的html文档的编码，您不必再做更多的事情。 If not you have to convert to the correct encoding. 如果不是，您必须转换为正确的编码。

test = test.decode("cp1252").encode("utf8")

Supposing that your string was cp1252 and that your html document is utf8 假设您的字符串是cp1252并且您的html文档是utf8

Answer 6

You shouldn't have anything to do, as Django will automatically escape characters : 你不应该有任何事情要做，因为Django会自动转义字符：

see : http://docs.djangoproject.com/en/dev/topics/templates/#id2 请参阅： http ： //docs.djangoproject.com/en/dev/topics/templates/#id2

使用python / django从字符串中删除非ASCII字符

问题描述

6 个解决方案

解决方案1
20 2010-04-30 08:16:56

解决方案2
3 2010-04-30 08:41:52

解决方案3
2 2017-07-19 07:58:47

解决方案4
2 2017-10-12 07:54:54

解决方案5
1 2010-04-30 08:54:29

解决方案6
0 2010-04-30 08:42:48

使用python / django从字符串中删除非ASCII字符

问题描述

6 个解决方案

解决方案1 20 2010-04-30 08:16:56

解决方案2 3 2010-04-30 08:41:52

解决方案3 2 2017-07-19 07:58:47

解决方案4 2 2017-10-12 07:54:54

解决方案5 1 2010-04-30 08:54:29

解决方案6 0 2010-04-30 08:42:48

解决方案1
20 2010-04-30 08:16:56

解决方案2
3 2010-04-30 08:41:52

解决方案3
2 2017-07-19 07:58:47

解决方案4
2 2017-10-12 07:54:54

解决方案5
1 2010-04-30 08:54:29

解决方案6
0 2010-04-30 08:42:48