简体   繁体   English

使用python / django从字符串中删除非ASCII字符

[英]Remove non-ASCII characters from a string using python / django

I have a string of HTML stored in a database. 我有一个存储在数据库中的HTML字符串。 Unfortunately it contains characters such as ® I want to replace these characters by their HTML equivalent, either in the DB itself or using a Find Replace in my Python / Django code. 不幸的是它包含诸如®之类的字符我希望用它们的HTML等效替换这些字符,无论是在DB本身还是在我的Python / Django代码中使用Find Replace。

Any suggestions on how I can do this? 有关如何做到这一点的任何建议?

You can use that the ASCII characters are the first 128 ones, so get the number of each character with ord and strip it if it's out of range 您可以使用ASCII字符是前128个字符,因此请使用ord获取每个字符的编号,如果超出范围则将其删除

# -*- coding: utf-8 -*-

def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)


test = u'éáé123456tgreáé@€'
print test
print strip_non_ascii(test)

Result 结果

éáé123456tgreáé@€
123456tgre@

Please note that @ is included because, well, after all it's an ASCII character. 请注意, @包含在内,因为它毕竟是ASCII字符。 If you want to strip a particular subset (like just numbers and uppercase and lowercase letters), you can limit the range looking at a ASCII table 如果要剥离特定子集(例如只是数字和大写和小写字母),可以限制查看ASCII表的范围

EDITED: After reading your question again, maybe you need to escape your HTML code, so all those characters appears correctly once rendered. 编辑:再次阅读您的问题后,您可能需要转义HTML代码,因此所有这些字符在呈现后都会正确显示。 You can use the escape filter on your templates. 您可以在模板上使用escape过滤器。

I found this a while ago, so this isn't in any way my work. 我刚才发现了这一点,所以这绝不是我的工作。 I can't find the source, but here's the snippet from my code. 我找不到源代码,但这是我代码中的代码片段。

def unicode_escape(unistr):
    """
    Tidys up unicode entities into HTML friendly entities

    Takes a unicode string as an argument

    Returns a unicode string
    """
    import htmlentitydefs
    escaped = ""

    for char in unistr:
        if ord(char) in htmlentitydefs.codepoint2name:
            name = htmlentitydefs.codepoint2name.get(ord(char))
            entity = htmlentitydefs.name2codepoint.get(name)
            escaped +="&#" + str(entity)

        else:
            escaped += char

    return escaped

Use it like this 像这样使用它

>>> from zack.utilities import unicode_escape
>>> unicode_escape(u'such as ® I want')
u'such as &#174 I want'

This code snippet may help you. 此代码段可能对您有所帮助。

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

def removeNonAscii(string):
    nonascii = bytearray(range(0x80, 0x100))
    return string.translate(None, nonascii)

nonascii_removed_string = removeNonAscii(string_to_remove_nonascii)

The encoding definition is very important here which is done in the second line. 编码定义在这里非常重要,这在第二行中完成。

There's a much simpler answer to this at https://stackoverflow.com/a/18430817/5100481 https://stackoverflow.com/a/18430817/5100481上有一个更简单的答案

To remove non-ASCII characters from a string, s , use: 从字符串,删除非ASCII字符s ,使用:

s = s.encode('ascii',errors='ignore')

Then convert it from bytes back to a string using: 然后使用以下命令将其从字节转换回字符串:

s = s.decode()

This all using Python 3.6 这一切都使用Python 3.6

To get rid of the special xml, html characters '<', '>', '&' you can use cgi.escape: 要摆脱特殊的xml,html字符'<','>','&',你可以使用cgi.escape:

import cgi
test = "1 < 4 & 4 > 1"
cgi.escape(test)

Will return: 将返回:

'1 &lt; 4 &amp; 4 &gt; 1'

This is probably the bare minimum you need to avoid problem. 这可能是您避免问题所需的最低限度。 For more you have to know the encoding of your string. 要获得更多信息,您必须知道字符串的编码。 If it fit the encoding of your html document you don't have to do something more. 如果它符合您的html文档的编码,您不必再做更多的事情。 If not you have to convert to the correct encoding. 如果不是,您必须转换为正确的编码。

test = test.decode("cp1252").encode("utf8")

Supposing that your string was cp1252 and that your html document is utf8 假设您的字符串是cp1252并且您的html文档是utf8

You shouldn't have anything to do, as Django will automatically escape characters : 你不应该有任何事情要做,因为Django会自动转义字符:

see : http://docs.djangoproject.com/en/dev/topics/templates/#id2 请参阅: http//docs.djangoproject.com/en/dev/topics/templates/#id2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM