regex python with unicode (japanese) character issue

Question

I want to remove part of a string (shown in bold) below, this is stored in the string oldString

[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY

im using the following regex within python

p=re.compile(ur"( [\W]+) (?=[A-Za-z ]+–)", re.UNICODE)
newString=p.sub("", oldString)

when i output the newString nothing has been removed

Answer 1

You can use the following snippet to solve the issue:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
str = u'[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY'
regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[A-Za-z ]+–)'
p = re.compile(regex, re.U)
match = p.sub("", str)
print match.encode("UTF-8")

See IDEONE demo

Beside # -*- coding: utf-8 -*- declaration, I have added @nhahtdh's character class to detect Japanese symbols .

Note that the match needs to be encoded as UTF-8 string "manually" since Python 2 needs to be "reminded" we are working with Unicode all the time.

Answer 2

I think you should use a regular expression like this one:

([\p{Hiragana}\p{Katakana}\p{Han}]+)

please refer also to this documentation.

EDIT: I also tested it here .

regex python with unicode (japanese) character issue

Question

2 answers

solution1
6 ACCPTED 2015-09-30 14:18:24

solution2
1 2015-09-30 10:30:40

regex python with unicode (japanese) character issue

Question

2 answers

solution1 6 ACCPTED 2015-09-30 14:18:24

solution2 1 2015-09-30 10:30:40

solution1
6 ACCPTED 2015-09-30 14:18:24

solution2
1 2015-09-30 10:30:40