简体   繁体   中英

Removing parentheses and everything in them with Regex

Having a bit of trouble with some code I'm working through. Basically, I have transcripts (txt files) for a few Japanese anime, of which I want to remove everything but the spoken lines (Japanese sentences) in order to do some NLP experiments.

I've managed to accomplish a good bit of cleaning, but where I'm stuck is with parentheses. A majority of the elements in my list start with a character's name inside parentheses (ie (Armin)). I want to remove these, but all the regex code I've found online doesn't seem to work.

Here's a snippet of the list I'm working with:

['(アルミン)その日', '人類は思い出した', '(アルミン)奴らに', '支配されていた恐怖を', '(アルミン)鳥籠の中に', 'とらわれていた―', '屈辱を', '(キース)総員', '戦闘用意!', '目標は1体だ', '必ず仕留め―', 'ここを', '我々', '人類', '最初の壁外拠点とする!', '(エルヴィン)あっ…', '目標接近!', '(キース)訓練どおり5つに分かれろ!', '囮は我々が引き受ける!', '全攻撃班', '立体機動に移れ!', '(エルヴィン)全方向から', '同時に叩くぞ!', '(モーゼス)やあーっ!']

I've tried the following code (it's as close as I could get):

no_parentheses = []

for line in mylist:

    if '(' in line:
        line = re.sub('\(.*\)','', line)
        no_parentheses.append(line)

    else:
        no_parentheses.append(line)

But when I view the results, those pesky parentheses remain in my list mockingly .

Could anyone offer suggestions to resolve this issue?

Thanks again!

The brackets used in the text are full-width brackets. Specifically, U+FF08 FULLWIDTH LEFT PARENTHESIS, and U+FF09 FULLWIDTH RIGHT PARENTHESIS.

Your regex should use full-width brackets as well.

line = re.sub('(.*)','', line)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM