简体   繁体   English

如何在Python中排除或删除特定部分

[英]How to exclude or remove the specific parts in Python

I'd like to analyze the chat log below to get the most frequently used words. 我想分析下面的聊天记录,以获取最常用的单词。 Therefore, the only parts I need are after [time] like [01:25]. 因此,我唯一需要的部分是[时间]之后的[01:25]。 How would I change? 我该如何改变?

+++

John, Max, Tracey with SuperChats

Date Saved : 2019-11-22 19:29:46

--------------- Tuesday, 9 July 2019 ---------------

[John] [00:27] Hi

[Max] [01:25] No

[Tracey] [02:31] Anybody has some bananas?

[Max] [04:39] No

[John] [20:58] Oh my goodness

--------------- Wednesday, 10 July 2019 ---------------

[Tracey] [14:33] Anybody has a mug?

[Max] [14:45] No

[John] [14:45] Oh my buddha

+++
from collections import Counter
import re

wordDict = Counter()
with open(r'C:chatlog.txt', 'r', encoding='utf-8') as f:
    chatline = f.readlines()
    chatline = [x.strip() for x in chatline]
    chatline = [x for x in chatline if x]

    for count in range(len(chatline)):
        if count < 2:
            continue
        elif '---------------' in chatline:
            continue

        re.split(r"\[\d{2}[:]\d{2}\]", x for x in chatline) #Maybe need to modify this part

print('Word', 'Frequency')
for word, freq in wordDict.most_common(50):
    print('{0:10s} : {1:3d}'.format(word, freq))

You can use the pattern /^.*?\\[\\d\\d:\\d\\d\\]\\s*(.+)$/ to match the text after the relevant lines (I'd work line by line instead of slurping the file with f.readlines() , which isn't memory-friendly). 您可以使用模式/^.*?\\[\\d\\d:\\d\\d\\]\\s*(.+)$/来匹配相关行之后的文本(我会逐行工作,而不是用f.readlines()对该文件进行f.readlines() ,这对内存不友好。 There should be no need to specially handle anything else since the timestamp is quite unique, but it wouldn't hurt to throw in a test for the brackets that appear around the username at the beginning of the line if you wish. 由于时间戳非常独特,因此无需专门处理其他任何事情,但是如果您愿意的话,可以对行开头用户名周围的括号进行测试不会有任何伤害。

import re
from collections import Counter

words = []

with open("chatlog.txt", "r", encoding="utf-8") as f:
    for line in f:
        m = re.search(r"^.*?\[\d\d:\d\d\]\s*(.+)$", line)

        if m:
            words.extend(re.split(r"\s+", m.group(1)))

for word, freq in Counter(words).most_common(50):
    print("{0:10s} : {1:3d}".format(word, freq))

Output: 输出:

No         :   3
Anybody    :   2
has        :   2
Oh         :   2
my         :   2
Hi         :   1
some       :   1
bananas?   :   1
goodness   :   1
a          :   1
mug?       :   1
buddha     :   1

As can be seen, stripping punctuation might also be worth doing. 可以看出,剥离标点符号也可能值得做。 You could use something like 您可以使用类似

# ...
if m:
    no_punc = re.split(r"\W+", m.group(1))
    words.extend([x for x in no_punc if x])
# ...

Try using split like this 尝试像这样使用split

lines = ["[Tracey] [02:31] Anybody has some bananas?","[John] [20:58] Oh my goodness"]
for i in lines:
    print(i.split(' ')[2:])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM