简体   繁体   中英

Get a part of a token after a specific character

I want to get a part of token in a text file. So far I wrote the code below:

from collections import Counter
import re

freq_dist = set()
words = re.findall(r'[\w+]+', open('output.txt').read())
freq_dist = Counter(words).most_common(10)

print(freq_dist)

My output.txt is as follows:

Türkiye+Noun ,+Punc terörizm+Noun+Gen ve+Conj kitle+Noun imha+Noun silah+Noun+A3pl+P3sg+Gen küresel+Adj düzey+Noun+Loc olus+Verb+Caus+PastPart+P3sg tehdit+Noun+Gen boyut+Noun+P3sg karsi+Adj+P3sg+Loc ,+Punc tüm+Det ülke+Noun+A3pl+Gen yay+Verb+Pass+Inf2+Gen önle+Verb+Pass+Inf2+P3sg hedef+Noun+A3pl+P3sg+Acc paylas+Verb+PastPart+P3pl ,+Punc daha+Noun güven+Noun+With ve+Conj istikrar+Noun+With bir+Num dünya+Noun düzen+Noun+P3sg için+PostpPCGen birlik+Noun+Loc çaba+Noun göster+Verb+PastPart+P3pl bir+Num asama+Noun+Dat gel+Verb+Pass+Inf2+P3sg+Acc samimi+Adj ol+Verb+ByDoingSo arzula+Verb+Prog2+Cop .+Punc 
Ab+Noun ile+PostpPCNom gümrük+Noun Alan+Noun+P3sg+Loc+Rel kurumsal+Adj iliski+Noun+A3pl 
club+Noun toplanti+Noun+A3pl+P3sg 
Türkiye+Noun+Gen -+Punc At+Noun gümrük+Noun isbirlik+Noun+P3sg komite+Noun+P3sg ,+Punc Ankara+Noun Anlasma+Noun+P3sg+Gen 6+Num madde+Noun+P3sg uyar+Verb+When ortaklik+Noun rejim+Noun+P3sg+Gen uygula+Verb+Pass+Inf2+P3sg+Acc ve+Conj gelis+Verb+Inf2+P3sg+Acc sagla+Verb+Inf1 üzere+PostpPCNom ortaklik+Noun Konsey+Noun+P3sg+Gen 2+Num /+Punc 69+Num sayili+Adj karar+Noun+P3sg ile+Conj teknik+Noun komite+Noun mahiyet+Noun+P3sg+Loc kur+Verb+Pass+Narr+Cop .+Punc 
nispi+Adj 
nisbi+Adj 
görece+Adj+With 
izafi+Adj 
obur+Adj 

I want to get the parts after the first + sign and save them in a list in a descending form. Forexaple, in Türkiye+Noun I want to get +Noun part or in terörizm+Noun+Gen I want to get Noun+gen or in isbirlik+Noun+P3sg I want to get Noun+P3sg and after this I want to list them by their count in a descending order like how many times +Noun or +Noun+gen appeared in the text.

How about splitting your input on spaces?

from collections import Counter

words = [word.split('+', 1)[1].strip() for word in open('output.txt').read().split(' ') if len(word)]
freq_dist = Counter(words).most_common(10)

print(freq_dist)

This would give you:

[('Noun', 16), ('Punc', 8), ('Adj', 8), ('Noun+P3sg', 6), ('Num', 5), ('Conj', 4), ('Noun+Gen', 3), ('Noun+P3sg+Gen', 3), ('Noun+Loc', 2), ('Verb+PastPart+P3pl', 2)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM