[英]How to exclude or remove the specific parts in Python
I'd like to analyze the chat log below to get the most frequently used words. 我想分析下面的聊天记录,以获取最常用的单词。 Therefore, the only parts I need are after [time] like [01:25]. 因此,我唯一需要的部分是[时间]之后的[01:25]。 How would I change? 我该如何改变?
+++
John, Max, Tracey with SuperChats
Date Saved : 2019-11-22 19:29:46
--------------- Tuesday, 9 July 2019 ---------------
[John] [00:27] Hi
[Max] [01:25] No
[Tracey] [02:31] Anybody has some bananas?
[Max] [04:39] No
[John] [20:58] Oh my goodness
--------------- Wednesday, 10 July 2019 ---------------
[Tracey] [14:33] Anybody has a mug?
[Max] [14:45] No
[John] [14:45] Oh my buddha
+++
from collections import Counter
import re
wordDict = Counter()
with open(r'C:chatlog.txt', 'r', encoding='utf-8') as f:
chatline = f.readlines()
chatline = [x.strip() for x in chatline]
chatline = [x for x in chatline if x]
for count in range(len(chatline)):
if count < 2:
continue
elif '---------------' in chatline:
continue
re.split(r"\[\d{2}[:]\d{2}\]", x for x in chatline) #Maybe need to modify this part
print('Word', 'Frequency')
for word, freq in wordDict.most_common(50):
print('{0:10s} : {1:3d}'.format(word, freq))
You can use the pattern /^.*?\\[\\d\\d:\\d\\d\\]\\s*(.+)$/
to match the text after the relevant lines (I'd work line by line instead of slurping the file with f.readlines()
, which isn't memory-friendly). 您可以使用模式/^.*?\\[\\d\\d:\\d\\d\\]\\s*(.+)$/
来匹配相关行之后的文本(我会逐行工作,而不是用f.readlines()
对该文件进行f.readlines()
,这对内存不友好。 There should be no need to specially handle anything else since the timestamp is quite unique, but it wouldn't hurt to throw in a test for the brackets that appear around the username at the beginning of the line if you wish. 由于时间戳非常独特,因此无需专门处理其他任何事情,但是如果您愿意的话,可以对行开头用户名周围的括号进行测试不会有任何伤害。
import re
from collections import Counter
words = []
with open("chatlog.txt", "r", encoding="utf-8") as f:
for line in f:
m = re.search(r"^.*?\[\d\d:\d\d\]\s*(.+)$", line)
if m:
words.extend(re.split(r"\s+", m.group(1)))
for word, freq in Counter(words).most_common(50):
print("{0:10s} : {1:3d}".format(word, freq))
Output: 输出:
No : 3
Anybody : 2
has : 2
Oh : 2
my : 2
Hi : 1
some : 1
bananas? : 1
goodness : 1
a : 1
mug? : 1
buddha : 1
As can be seen, stripping punctuation might also be worth doing. 可以看出,剥离标点符号也可能值得做。 You could use something like 您可以使用类似
# ...
if m:
no_punc = re.split(r"\W+", m.group(1))
words.extend([x for x in no_punc if x])
# ...
Try using split like this 尝试像这样使用split
lines = ["[Tracey] [02:31] Anybody has some bananas?","[John] [20:58] Oh my goodness"]
for i in lines:
print(i.split(' ')[2:])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.