[英]Reformatting text in Python
I am new to python and having great difficulty parsing through a log file. 我是python新手,很难通过日志文件进行解析。 Can you please help me understand how I can accomplish the below in the most Pythonic way.
您能否帮助我了解如何以最Python的方式完成以下任务。
----- Log Entry 5 -----
Time : 2016-07-12 09:00:00
Animal : Brown Bear
Bird : White Owl
Fish : Salmon
----- Log Entry 6 -----
Time : 2016-07-12 09:00:00
Animal : Brown Bear
Bird : Parrot
Fish : Tuna
----- Log Entry 7 -----
Time : 2016-07-12 09:00:00
Animal : Lion
Bird : White Owl
Fish : Sword Fish
----- Log Entry 8 -----
Time : 2016-07-12 09:15:00
Animal : Lion
Bird : White Owl
Fish : Sword Fish
Desired Output 1: I would like to reformat the log to look like the below: 所需的输出1:我想重新格式化日志,使其看起来如下所示:
Time: 2016-07-12 09:00:00 Animal: Brown Bear Bird: White Owl Fish : Salmon
Time: 2016-07-12 09:00:00 Animal: Brown Bear Bird: Parrot Fish : Tuna
Time: 2016-07-12 09:00:00 Animal: Lion Bird: White Owl Fish : Sword Fish
Time: 2016-07-12 09:15:00 Animal: Lion Bird: White Owl Fish : Sword Fish
Desired Output 2: Then I would like to have the ability to query a time stamp and get a summary of counts: 所需的输出2:然后,我希望能够查询时间戳并获得计数摘要:
Time: 2016-07-12 09:00:00
Name: Count:
Brown Bear 2
Lion 1
White Owl 2
Parrot 1
Salmon 1
Tuna 1
Sword Fish 1
Time: 2016-07-12 09:15:00
Name: Count:
Lion 1
White Owl 1
Sword Fish 1
My Code So Far: 到目前为止,我的代码:
import os, sys, time, re, collections, subprocess
show_cmd = 'cat question | egrep -v \'^$|=|Log\' | awk \'ORS=NR%4?FS:RS\' | grep Time'
log = (subprocess.check_output(show_cmd, shell=True).decode('utf-8'))
def time_field():
logRegex = re.compile(r'Time\s*:.*\d\d\d-\d\d-\d\d\s\d\d:\d\d')
log_parsed = (logRegex.findall(log))
a = (str(log_parsed).replace(' ', ''))
a = ((' ' + a[1:-1]).split(','))
for i in a:
print(i)
time_field()
There are a lot of ways to do this. 有很多方法可以做到这一点。 Personally I would avoid using regex for this because it probably won't be more efficient and the expression becomes cumbersome and inflexible.
就我个人而言,我会避免使用正则表达式,因为它可能不会更有效,并且表达式变得笨拙和僵化。 Here is something I came up with:
这是我想到的:
class Entry:
def __init__(self):
self.time = None
self.animal = None
self.bird = None
self.fish = None
def __repr__(self):
fmt = "{0} {1} {2} {3}".format(
"Time: {time: <{width}}",
"Animal: {animal: <{width}}",
"Bird: {bird: <{width}}",
"Fish: {fish: <{width}}")
return fmt.format(
time=self.time, animal=self.animal,
bird=self.bird, fish=self.fish,
width=12)
def __radd__(self, other):
return self.__add__(other)
def __add__(self, other):
if type(other) == dict:
for i in [self.animal, self.bird, self.fish]:
if i in other: other[i] += 1
else: other[i] = 1
return other
elif type(other) == Entry:
return self.__add__({}) + other
else:
return self.__add__({})
def parse_log(path):
def extract(line):
start = line.find(':') + 1
return line[start:].strip()
entries = []
entry = None
with open(path, 'r') as f:
for line in f.readlines():
if line.startswith('-----'):
if entry: entries.append(entry)
entry = Entry()
elif line.startswith('Time'):
entry.time = extract(line)
elif line.startswith('Animal'):
entry.animal = extract(line)
elif line.startswith('Bird'):
entry.bird = extract(line)
elif line.startswith('Fish'):
entry.fish = extract(line)
if entry: entries.append(entry)
return entries
def print_output_1(entries):
for entry in entries:
print entry
def print_output_2(entries, time):
animals = sum([e for e in entries if e.time == time])
print "Time: {0}".format(time)
print "Name: Count:"
for animal, count in animals.items():
print "{animal: <{width}} {count}".format(
animal=animal, count=count, width=12)
logPath = 'log.log'
time = '2016-07-12 09:15:00'
entries = parse_log(logPath)
print_output_1(entries)
print ""
print_output_2(entries, time)
The output (given that log.log
matches the input you gave) is: 输出(假设
log.log
与您提供的输入匹配)是:
Time: 2016-07-12 09:00:00 Animal: Brown Bear Bird: White Owl Fish: Salmon
Time: 2016-07-12 09:00:00 Animal: Brown Bear Bird: Parrot Fish: Tuna
Time: 2016-07-12 09:00:00 Animal: Lion Bird: White Owl Fish: Sword Fish
Time: 2016-07-12 09:15:00 Animal: Lion Bird: White Owl Fish: Sword Fish
Time: 2016-07-12 09:15:00
Name: Count:
White Owl 1
Sword Fish 1
Lion 1
The way this code works is to use object oriented programming to our advantage in order to simplify the tasks we need to do: store log entries, represent log entries in a specific format, and combine log entries according to a specific property. 这段代码的工作方式是利用我们的优势来使用面向对象的编程,以简化我们需要完成的任务:存储日志条目,以特定格式表示日志条目,并根据特定属性组合日志条目。
First, note that the Entry
object and its properties ( self.time
, self.animal
, self.bird
, self.fish
) represents an entry in the log. 首先,请注意
Entry
对象及其属性( self.time
, self.animal
, self.bird
, self.fish
)表示日志中的一个条目。 Assuming that the information stored in its properties is correct, a method can be created to represent that information as a formatted string. 假设存储在其属性中的信息是正确的,则可以创建一种方法来将该信息表示为格式化字符串。 The method
__repr__()
is called when python wants the string representation of an object, so it seemed like a good place to put this code. 当python需要对象的字符串表示形式时,将调用
__repr__()
方法,因此,放置该代码似乎是个好地方。 There is heavy use of the format
function in this method, but it should be clear how it works after browsing the python documentation on format
. 在此方法中大量使用了
format
函数,但是在浏览format
上的python文档后应该清楚它是如何工作的。
A method for combining these entry objects is needed in order to get the second output you specified. 需要一种用于组合这些条目对象的方法,以便获得您指定的第二个输出。 This can be done many ways and the way I did it is not necessarily the best.
这可以用很多方法完成,而我做的方法不一定是最好的。 I used the
__radd__()
and __add__()
methods which are called when the +
operator is used on an object. 我使用了
__radd__()
和__add__()
方法,这些方法在对象上使用+
运算符时会被调用。 By doing this, the code entry1 + entry2
or sum([entry1, entry2])
can be used to get the sum of the animals in both entries. 这样,可以使用代码
entry1 + entry2
或sum([entry1, entry2])
来获取两个条目中动物的总和。 The Entry
class can not be used to store the result of the sum, however, because it cannot contain arbitrary information. 但是,
Entry
类不能用于存储总和的结果,因为它不能包含任意信息。 Instead, I chose to use a dict
object to be the result of summing two Entry
objects. 取而代之的是,我选择使用
dict
对象作为两个Entry
对象求和的结果。 In order to sum more than two Entry
objects, Entry
must also be able to sum with a dict
object because Entry + Entry + Entry
results in dict + Entry
. 为了对两个以上的
Entry
对象求和, Entry
还必须能够与dict
对象求和,因为Entry + Entry + Entry
导致dict + Entry
。
The __add__()
function checks if the object it is being added to is a dict
object. __add__()
函数检查要添加到该对象的对象是否是dict
对象。 If this is the case, it checks if each of the animals in the entry exist in the dict
already. 如果是这样,它将检查条目中的每个动物是否已经存在于
dict
。 If not, it will add the animal as a key. 如果没有,它将添加动物作为关键。 Otherwise, it will increment the value of that key.
否则,它将增加该键的值。
__radd__()
is similar to __add__()
except that it is used in some special circumstances. __radd__()
类似于__add__()
不同的是它在某些特殊情况下使用。 See the python documentation for more information. 有关更多信息,请参见python文档。
For the case where the object is an Entry
, code could have been written to gather all of the animals from each Entry
object and create a dict
from that information, but since there is already code to add an Entry
with a dict
it is easier to first add one object to an empty dict
and then add the resulting dict
with the other Entry
object. 对于对象是
Entry
,可以编写代码以从每个Entry
对象收集所有动物并根据该信息创建dict
,但是由于已有代码添加带有dict
的Entry
,因此更容易首先将一个对象添加到空dict
,然后将结果dict
与另一个Entry
对象添加。
For all other objects, the Entry
will simply return the dict
description of itself, or itself added with an empty dict
. 对于所有其他对象,
Entry
将仅返回其自身的dict
说明,或者自身添加有空dict
。
Now all of the tools exist to accomplish the goals listed earlier. 现在,所有工具都可以实现前面列出的目标。 To get a string representation of an
Entry
that matches desired output 1, all that is needed is print entry
or strrepr = str(entry)
. 要获得与所需输出1匹配的
Entry
的字符串表示形式,所需的全部是print entry
或strrepr = str(entry)
。 To get desired output 2, a little more work is involved, but it is simply summing all entries that have the same self.time
property and then displaying the resulting dict. 为了获得所需的输出2,需要进行更多的工作,但这只是将具有相同
self.time
属性的所有条目求和,然后显示结果dict。
The last part of the code not covered is the parsing of the log to create a list of Entry
objects. 未覆盖的代码的最后一部分是对日志的解析,以创建
Entry
对象的列表。 The code simply walks line by line through the log and populates an Entry
with the information. 代码只是简单地在日志中逐行浏览,并使用信息填充
Entry
。 I feel like this is pretty straightforward, but you can feel free to ask questions if it does not make sense. 我觉得这很简单,但是如果没有意义,您可以随时提出问题。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.