简体   繁体   English

使用多个拆分选择文本

[英]Selecting text using multiple splits

I've started to learn python and am stuck on an assignment regarding manipulating text data. 我已经开始学习python,并被困在有关操纵文本数据的作业上。 An example of the text lines I need to manipulate: 我需要处理的文字行示例:

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008

I need to extract the hours from each line (in this case 09) and then find the most common hours the emails were sent. 我需要从每行中提取小时数(在本例中为09),然后找到发送电子邮件的最常见时间。

Basically, what I need to do is build a for loop that splits each text by colon 基本上,我需要做的是建立一个for循环,以冒号分隔每个文本

split(':')

and then splits by space: 然后按空间分割:

split()

I've tried for hours, but can't seem to figure it out. 我已经尝试了几个小时,但似乎无法弄清楚。 What my code looks like so far: 到目前为止,我的代码是什么样的:

name = raw_input("Enter file:")
if len(name) < 1 : name = "mbox-short.txt"
handle = open(name)
counts = dict()
lst = list()
temp = list()
for line in handle:
    if not "From " in line: continue
    words = line.split(':')  
    for word in words:
        counts[word] = counts.get(word,0) + 1

for key, val in counts.items():
    lst.append( (val, key) )
lst.sort(reverse = True)

for val, key in lst:
print key, val

The code above only does 1 split, but I've kept trying multiple methods to split the text again. 上面的代码仅进行1次拆分,但我一直尝试使用多种方法再次拆分文本。 I keep getting a list attribute error, saying "list object has no attribute split". 我不断收到列表属性错误,说“列表对象没有属性拆分”。 Would appreciate any help on this. 希望对此有所帮助。 Thanks again 再次感谢

First, 第一,

import re

Then replace 然后更换

words = line.split(':')  
for word in words:
    counts[word] = counts.get(word,0) + 1

by 通过

line = re.search("[0-9]{2}:[0-9]{2}:[0-9]{2}", line).group(0)
words = line.split(':')
hour = words[0]
counts[hour] = counts.get(hour, 0) + 1

Input: 输入:

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 12:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 15:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 12:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 13:14:16 2008
From stephen.marquard@uct.ac.za Sat Jan  5 12:14:16 2008

Output: 输出:

09 4
12 3
15 1
13 1

Using the same test file as Marcel Jacques Machado: 使用与Marcel Jacques Machado相同的测试文件:

>>> from collections import Counter
>>> Counter(line.split(' ')[-2].split(':')[0] for line in open('input')).items()
[('12', 3), ('09', 4), ('15', 1), ('13', 1)]

This shows that 09 occurs 4 times while 13 occurs only once. 这表明09发生4次,而13仅发生一次。

If we want prettier output, we can do some formatting. 如果想要更漂亮的输出,可以进行一些格式化。 This shows the hours and their counts sorted from most common to least common: 这显示了小时数及其计数,从最常见到最不常见:

>>> print('\n'.join('{} {}'.format(hh, n) for hh,n in Counter(line.split(' ')[-2].split(':')[0] for line in open('input')).most_common()))
09 4
12 3
15 1
13 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM