從Unicode格式的字符串中刪除標點符號

Question

我有一個從字符串列表中刪除標點符號的函數：

def strip_punctuation(input):
    x = 0
    for word in input:
        input[x] = re.sub(r'[^A-Za-z0-9 ]', "", input[x])
        x += 1
    return input

我最近修改了腳本以使用Unicode字符串，以便可以處理其他非西方字符。 遇到這些特殊字符並返回空的Unicode字符串時，此函數將中斷。 如何可靠地從Unicode格式的字符串中刪除標點符號？

Answer 1

您可以使用unicode.translate()方法：

import unicodedata
import sys

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

您還可以使用regex模塊支持的r'\\p{P}' ：

import regex as re

def remove_punctuation(text):
    return re.sub(ur"\p{P}+", "", text)

Answer 2

如果要在Python 3中使用JF Sebastian的解決方案：

import unicodedata
import sys

tbl = dict.fromkeys(i for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))
def remove_punctuation(text):
    return text.translate(tbl)

Answer 3

您可以使用unicodedata模塊的category函數遍歷字符串，以確定字符是否為標點符號。

有關category可能輸出，請參見unicode.org上有關常規Category值的文檔

import unicodedata.category as cat
def strip_punctuation(word):
    return "".join(char for char in word if cat(char).startswith('P'))
filtered = [strip_punctuation(word) for word in input]

此外，請確保正確處理編碼和類型。 此演示文稿是一個不錯的起點： http : //bit.ly/unipain

Answer 4

根據Daenyth答案的簡短版本

import unicodedata

def strip_punctuation(text):
    """
    >>> strip_punctuation(u'something')
    u'something'

    >>> strip_punctuation(u'something.,:else really')
    u'somethingelse really'
    """
    punctutation_cats = set(['Pc', 'Pd', 'Ps', 'Pe', 'Pi', 'Pf', 'Po'])
    return ''.join(x for x in text
                   if unicodedata.category(x) not in punctutation_cats)

input_data = [u'somehting', u'something, else', u'nothing.']
without_punctuation = map(strip_punctuation, input_data)

從Unicode格式的字符串中刪除標點符號

問題描述

4 個解決方案

解決方案1
73 已采納 2012-06-16 20:11:54

解決方案2
19 2014-02-07 19:14:58

解決方案3
8 2012-06-16 19:34:19

解決方案4
7 2012-06-16 19:55:19

從Unicode格式的字符串中刪除標點符號

問題描述

4 個解決方案

解決方案1 73 已采納 2012-06-16 20:11:54

解決方案2 19 2014-02-07 19:14:58

解決方案3 8 2012-06-16 19:34:19

解決方案4 7 2012-06-16 19:55:19

解決方案1
73 已采納 2012-06-16 20:11:54

解決方案2
19 2014-02-07 19:14:58

解決方案3
8 2012-06-16 19:34:19

解決方案4
7 2012-06-16 19:55:19