如何在Python中找到字符串中的中文或日文字符？

Question

如：

str = 'sdf344asfasf天地方益3権sdfsdf'

添加()到中文和日文字符：

strAfterConvert = 'sdfasfasf(天地方益)3(権)sdfsdf'

Answer 1

首先，您可以檢查該字符是否位於以下unicode塊之一：

Unicode Block'CJK Unified Ideographs' - U + 4E00到U + 9FFF
Unicode Block'CJK Unified Ideographs Extension A' - U + 3400到U + 4DBF
Unicode Block'CJK Unified Ideographs Extension B' - U + 20000到U + 2A6DF
Unicode Block'CJK Unified Ideographs Extension C' - U + 2A700到U + 2B73F
Unicode Block'CJK Unified Ideographs Extension D' - U + 2B740到U + 2B81F

之后，您需要做的就是遍歷字符串，檢查字符是中文，日文還是韓文（CJK）並相應地追加：

# -*- coding:utf-8 -*-
ranges = [
  {"from": ord(u"\u3300"), "to": ord(u"\u33ff")},         # compatibility ideographs
  {"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")},         # compatibility ideographs
  {"from": ord(u"\uf900"), "to": ord(u"\ufaff")},         # compatibility ideographs
  {"from": ord(u"\U0002F800"), "to": ord(u"\U0002fa1f")}, # compatibility ideographs
  {'from': ord(u'\u3040'), 'to': ord(u'\u309f')},         # Japanese Hiragana
  {"from": ord(u"\u30a0"), "to": ord(u"\u30ff")},         # Japanese Katakana
  {"from": ord(u"\u2e80"), "to": ord(u"\u2eff")},         # cjk radicals supplement
  {"from": ord(u"\u4e00"), "to": ord(u"\u9fff")},
  {"from": ord(u"\u3400"), "to": ord(u"\u4dbf")},
  {"from": ord(u"\U00020000"), "to": ord(u"\U0002a6df")},
  {"from": ord(u"\U0002a700"), "to": ord(u"\U0002b73f")},
  {"from": ord(u"\U0002b740"), "to": ord(u"\U0002b81f")},
  {"from": ord(u"\U0002b820"), "to": ord(u"\U0002ceaf")}  # included as of Unicode 8.0
]

def is_cjk(char):
  return any([range["from"] <= ord(char) <= range["to"] for range in ranges])

def cjk_substrings(string):
  i = 0
  while i<len(string):
    if is_cjk(string[i]):
      start = i
      while is_cjk(string[i]): i += 1
      yield string[start:i]
    i += 1

string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
for sub in cjk_substrings(string):
  string = string.replace(sub, "(" + sub + ")")
print string

以上打印

sdf344asfasf(天地方益)3(権)sdfsdf

為了面向未來，您可能需要留意CJK Unified Ideographs Extension E.它將附帶Unicode 8.0 ，計划於2015年6月發布。 我已將它添加到范圍中，但在發布Unicode 8.0之前不應包含它。

[編輯]

增加了CJK兼容性表意文字，日本假名和CJK激進派。

Answer 2

您可以使用regex包進行編輯，該包支持檢查每個字符的Unicode“ Script ”屬性，並且是re包的替代品：

import regex as re

pattern = re.compile(r'([\p{IsHan}\p{IsBopo}\p{IsHira}\p{IsKatakana}]+)', re.UNICODE)

input = u'sdf344asfasf天地方益3権sdfsdf'
output = pattern.sub(r'(\1)', input)
print output  # Prints: sdf344asfasf(天地方益)3(権)sdfsdf

您應該使用您認為是“中文或日文”的字符腳本/塊來調整\\p{Is...}序列。

Answer 3

從受摩西機器翻譯工具包啟發的NLTK最前沿分支之一：

def is_cjk(character):
    """"
    Checks whether character is CJK.

        >>> is_cjk(u'\u33fe')
        True
        >>> is_cjk(u'\uFE5F')
        False

    :param character: The character that needs to be checked.
    :type character: char
    :return: bool
    """
    return any([start <= ord(character) <= end for start, end in 
                [(4352, 4607), (11904, 42191), (43072, 43135), (44032, 55215), 
                 (63744, 64255), (65072, 65103), (65381, 65500), 
                 (131072, 196607)]
                ])

有關ord()數字的細節：

class CJKChars(object):
    """
    An object that enumerates the code points of the CJK characters as listed on
    http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane

    This is a Python port of the CJK code point enumerations of Moses tokenizer:
    https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl#L309
    """
    # Hangul Jamo (1100–11FF)
    Hangul_Jamo = (4352, 4607) # (ord(u"\u1100"), ord(u"\u11ff"))

    # CJK Radicals Supplement (2E80–2EFF)
    # Kangxi Radicals (2F00–2FDF)
    # Ideographic Description Characters (2FF0–2FFF)
    # CJK Symbols and Punctuation (3000–303F)
    # Hiragana (3040–309F)
    # Katakana (30A0–30FF)
    # Bopomofo (3100–312F)
    # Hangul Compatibility Jamo (3130–318F)
    # Kanbun (3190–319F)
    # Bopomofo Extended (31A0–31BF)
    # CJK Strokes (31C0–31EF)
    # Katakana Phonetic Extensions (31F0–31FF)
    # Enclosed CJK Letters and Months (3200–32FF)
    # CJK Compatibility (3300–33FF)
    # CJK Unified Ideographs Extension A (3400–4DBF)
    # Yijing Hexagram Symbols (4DC0–4DFF)
    # CJK Unified Ideographs (4E00–9FFF)
    # Yi Syllables (A000–A48F)
    # Yi Radicals (A490–A4CF)
    CJK_Radicals = (11904, 42191) # (ord(u"\u2e80"), ord(u"\ua4cf"))

    # Phags-pa (A840–A87F)
    Phags_Pa = (43072, 43135) # (ord(u"\ua840"), ord(u"\ua87f"))

    # Hangul Syllables (AC00–D7AF)
    Hangul_Syllables = (44032, 55215) # (ord(u"\uAC00"), ord(u"\uD7AF"))

    # CJK Compatibility Ideographs (F900–FAFF)
    CJK_Compatibility_Ideographs = (63744, 64255) # (ord(u"\uF900"), ord(u"\uFAFF"))

    # CJK Compatibility Forms (FE30–FE4F)
    CJK_Compatibility_Forms = (65072, 65103) # (ord(u"\uFE30"), ord(u"\uFE4F"))

    # Range U+FF65–FFDC encodes halfwidth forms, of Katakana and Hangul characters
    Katakana_Hangul_Halfwidth = (65381, 65500) # (ord(u"\uFF65"), ord(u"\uFFDC"))

    # Supplementary Ideographic Plane 20000–2FFFF
    Supplementary_Ideographic_Plane = (131072, 196607) # (ord(u"\U00020000"), ord(u"\U0002FFFF"))

    ranges = [Hangul_Jamo, CJK_Radicals, Phags_Pa, Hangul_Syllables, 
              CJK_Compatibility_Ideographs, CJK_Compatibility_Forms, 
              Katakana_Hangul_Halfwidth, Supplementary_Ideographic_Plane]

在這個答案和@EvenLisle子串答案中組合了is_cjk()

>>> from nltk.tokenize.util import is_cjk
>>> text = u'sdf344asfasf天地方益3権sdfsdf'
>>> [1 if is_cjk(ch) else 0 for ch in text]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
>>> def cjk_substrings(string):
...     i = 0
...     while i<len(string):
...         if is_cjk(string[i]):
...             start = i
...             while is_cjk(string[i]): i += 1
...             yield string[start:i]
...         i += 1
... 
>>> string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
>>> for sub in cjk_substrings(string):
...     string = string.replace(sub, "(" + sub + ")")
... 
>>> string
u'sdf344asfasf(\u5929\u5730\u65b9\u76ca)3(\u6a29)sdfsdf'
>>> print string
sdf344asfasf(天地方益)3(権)sdfsdf

Answer 4

如果你不能使用提供訪問IsKatakana regex模塊， IsKatakana IsHan屬性如@一二三的回答所示 ; 你可以使用來自@ EvenLisle的字符范圍來回答 stdlib的re模塊：

>>> import re
>>> print(re.sub(u"([\u3300-\u33ff\ufe30-\ufe4f\uf900-\ufaff\U0002f800-\U0002fa1f\u30a0-\u30ff\u2e80-\u2eff\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002b73f\U0002b740-\U0002b81f\U0002b820-\U0002ceaf]+)", r"(\1)", u'sdf344asfasf天地方益3権sdfsdf'))
sdf344asfasf(天地方益)3(権)sdfsdf

注意已知問題。

您還可以檢查Unicode類別：

>>> import unicodedata
>>> unicodedata.category(u'天')
'Lo'
>>> unicodedata.category(u's')
'Ll'

如何在Python中找到字符串中的中文或日文字符？

問題描述

4 個解決方案

解決方案1
21 2015-05-06 07:51:14

解決方案2
12 2015-05-07 12:18:11

解決方案3
5 2016-05-18 22:29:16

解決方案4
2 2015-05-08 05:01:22

如何在Python中找到字符串中的中文或日文字符？

問題描述

4 個解決方案

解決方案1 21 2015-05-06 07:51:14

解決方案2 12 2015-05-07 12:18:11

解決方案3 5 2016-05-18 22:29:16

解決方案4 2 2015-05-08 05:01:22

解決方案1
21 2015-05-06 07:51:14

解決方案2
12 2015-05-07 12:18:11

解決方案3
5 2016-05-18 22:29:16

解決方案4
2 2015-05-08 05:01:22