简体   繁体   English

逐行解析文本文件并将数据组织到嵌套字典中?

[英]Parse text file line-by-line and organize data into nested dictionaries?

I have a file that runs a CMD trace route command whenever I get a ping timeout, and I print that out to a file. 每当有ping超时时,我都有一个运行CMD跟踪路由命令的文件,然后将其打印到文件中。 I have a file formatted as follows: 我有一个格式如下的文件:

Sun 02/17/2019 13:20:44.27 PING ERROR 1

Tracing route to _____________ [IP_REDACTED]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  192.168.1.1
  2    <1 ms    <1 ms    <1 ms  [IP_REDACTED]
  3     1 ms    <1 ms     1 ms  [IP_REDACTED]
  4     *        *        *     Request timed out.
  5     7 ms    10 ms     6 ms  [IP_REDACTED]
  6     8 ms     4 ms     6 ms  [IP_REDACTED]
  7     5 ms     7 ms     6 ms  [IP_REDACTED]

Trace complete.

Sun 02/17/2019 13:45:59.27 PING ERROR 2

Tracing route to _____________ [IP_REDACTED]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  192.168.1.1
  2    <1 ms    <1 ms    <1 ms  [IP_REDACTED]
  3     1 ms    <1 ms     1 ms  [IP_REDACTED]
  4    23 ms     *        *     [IP_REDACTED]
  5     7 ms    10 ms     6 ms  [IP_REDACTED]
  6     8 ms     4 ms     6 ms  [IP_REDACTED]
  7     5 ms     7 ms     6 ms  [IP_REDACTED]

Trace complete.

Sun 02/17/2019 15:45:59.27 PING ERROR 3

Tracing route to _____________ [IP_REDACTED]
over a maximum of 30 hops:

  1    <1 ms    <1 ms    <1 ms  192.168.1.1
  2    <1 ms    <1 ms    <1 ms  [IP_REDACTED]
  3     1 ms    <1 ms     1 ms  [IP_REDACTED]
  4    23 ms    12 ms    11 ms  [IP_REDACTED]
  5     7 ms    10 ms     6 ms  [IP_REDACTED]
  6     8 ms     *        6 ms  [IP_REDACTED]
  7     5 ms     7 ms     6 ms  [IP_REDACTED]

Trace complete.

The first line has the timestamp of the trace route command. 第一行具有trace route命令的时间戳。 I want to use Python to graph the number of times hop # 4 loses a packet (signaled by a '*' character) over time. 我想使用Python来绘制第4跳随时间丢失数据包(用“ *”字符表示)的次数。 Before I cross that road, I need to organize the data. 在走那条路之前,我需要整理数据。 I figured nested dictionaries is the way to go with python. 我认为嵌套字典是使用python的方法。

I am new to Python and the syntax is confusing. 我是Python的新手,语法令人困惑。

The code below shows my attempt. 下面的代码显示了我的尝试。 This is the basic flow I was going for: 这是我要执行的基本流程:

  1. look at a line in the file. 看文件中的一行。

  2. If line has the word "ERROR", save that line 如果该行包含单词“ ERROR”,请保存该行

  3. Look at the other lines. 查看其他行。 if line starts with "4", parse the data from step #2 如果行以“ 4”开头,则分析步骤2中的数据

  4. Grab month, day, hour, and min and put those into separate variables 抓取月,日,小时和分钟,并将其放入单独的变量中

  5. Create nested dictionary with this data. 使用此数据创建嵌套字典。

  6. Repeat these steps for all the errors in the file. 对文件中的所有错误重复这些步骤。 Add to dictionary of step #5 添加到步骤#5的字典

  7. At the end, be able to print the data from any scope (eg number of errors in a day, or number of errors in a specific hour of a day) 最后,能够打印任何范围的数据(例如一天中的错误数或一天中特定时间的错误数)

For example, a dictionary might look like this: 例如,一个字典可能看起来像这样:

day{ 6 : hour{ 2 : min{ 15 : 2, 30 : 1, 59 : 1 }, 9 : min{ 10: 1 }}}

There were 4 errors in hour 2 of day 6. These errors happened in minute 15, 20, and 59. 第6天的第2小时有4个错误。这些错误发生在第15、20和59分钟。

day_d = {}

with open("2019-02-17_11-54-AM.log", "r") as fo:

    for line in fo:
        list = line.strip() # Expected: each index in list is a word
        if list.count('ERROR'):
            # Save the line to parse if trace route reports
            # bad data on hop 4
            lineToParse = line

        if "4" in list[0]:
            # We found the line that starts with "4"
            if "*" in list[1] or "*" in list[2] or "*" in list[3]:
                # We should parse the data in lineToParse

                # Expected: lineToParse[1] = "02/17/2019"
                word  = lineToParse[1].split("/")
                month = word[0] # I don't care about month
                day   = word[1]
                year  = word[2] # I don't care about year

                # Expected: lineToParse[2] == "13:20:44.27"
                word = lineToParse[2].split(":")
                hour = word[0]
                min  = word[1]
                sec  = word[2] # I don't care about seconds


                # Keep track of number occurances in min
                if day in day_d:
                    if hour in day_d[day]:
                        if min in day_d[day[hour]]
                            day_d[day[hour[min]]] += 1
                        else:
                            day_d[day[hour[min]]] = 1
                    else:
                        min_d = { min : 1 }
                        day_d[day[hour]] = min_d
                else:
                    min_d = { min : 1 }
                    hour_d = { hour : min_d }
                    day_d[day] = hour_d


#Print number of occurances in hour "12" of day "01"
hourCounter = 0;
if "01" in day_d:
    if "12" in day:
        day["12"] = hour_d
        for min in hour_d:
            hourCounter += int(hour_d[min], 10) # Convert string to base 10 int
print(hourCounter)

EDIT: After reviewing Gnudiff's reply, I was able to accomplish what I wanted to do. 编辑:在查看格努迪夫的答复后,我能够完成我想做的事情。 My code is as follows: 我的代码如下:

from matplotlib import pyplot as plt
from matplotlib import style

style.use('ggplot')

from datetime import datetime as DT

ping_errors = dict()
data = dict()

with open("2019-02-17_02-41-PM.log", "r") as fo:
    for line in fo:
        if 'ERROR' in line: # A tracert printout will follow
            pingtime = DT.strptime(line[:23],'%a %m/%d/%Y %H:%M:%S') # fixed format datetime format allows us just to cut the string precisely
        words = line.strip().split()
        if len(words) > 0:
            if words[0] == '4':
                if '*' in line:
                    # Found packet timeout in hop # 4
                    ping_errors[pingtime] = 1


# Create key value pairs. Keys are the hours from 1 to 24
# and values are the drops for each hour.
for i in range(1,24):
    data[i] = 0
    for x in ping_errors.keys():
        if x.time().hour == i:
            data[i] += 1


# Prepare the chart         
x_axis = list(data.keys())
y_axis = list(data.values())

fig, ax = plt.subplots()

ax.bar(x_axis, y_axis, align='center')

ax.set_title('10-second drops from ___ to ____')
ax.set_ylabel('Number of drops')
ax.set_xlabel('Hour')

ax.set_xticks(x_axis)
ax.set_yticks(y_axis)

plt.show()

Nested dictionaries do not really look like the right tool for the job, because the syntax and the storage becomes convoluted, as you yourself have found out. 嵌套字典看起来并不是真正适合该工作的工具,因为您自己已经发现,语法和存储变得很复杂。

What you are having is something much more tabular in format already from the output of ping and it will become much easier to deal with, if you keep it as such. ping的输出已经在表格中提供了更多表格格式的内容,如果您保持原样,它将变得更加容易处理。

So, you want to store ping errors and be able to find how many happened at what time. 因此,您想要存储ping错误并能够找到在什么时候发生了多少错误。

If this was a large project, I would probably store and query the data against some external database. 如果这是一个大项目,我可能会针对一些外部数据库来存储和查询数据。 But let's see how that could work with Python. 但是,让我们看看如何使用Python。

Here are some things which won't work and which need to be changed: 这里有些东西行不通,需要更改:

1) as Laurent mentioned in comments, you can't use reserved words for variable names. 1)正如Laurent在评论中提到的那样,您不能将保留字用作变量名。 In this case "list" should be renamed something else 在这种情况下,“列表”应重命名为其他名称

2) line is a string and your line.strip() still remains a string, not a list. 2)line是一个字符串,而line.strip()仍然是字符串,而不是列表。 if you want to split the line string by spaces, you should use something like: linewords=line.split() #and use this variable instead of your list variable 如果要用空格分隔行字符串,则应使用类似以下内容的方法: linewords=line.split() #and use this variable instead of your list variable

3) for attempting date and time manipulation it is generally very helpful to use appropriate modules. 3)对于尝试日期和时间操纵,通常使用适当的模块非常有帮助。 In this case datetime.datetime 在这种情况下datetime.datetime

So, the start of your loop could go something like this: 因此,循环的开始可能是这样的:

from datetime import datetime as DT

ping_errors=dict()

with open("2019-02-17_11-54-AM.log", "r") as fo:
    firstline=fo.readline()
    if 'ERROR' in firstline: # this file has an error, so we will process it
       pingtime=DT.strptime(firstline[:23],'%a %m/%d/%Y %H:%M:%S') # fixed format datetime format allows us just to cut the string precisely
       ping_errors[pingtime]=list()
       for line in fo:
           words=line.strip().split()
           if words[1]=='*':
              # this is the hop with error, add its info to this
              ping_errors[pingtime].append(words[0]) # add the number of hop which had the error

after this you have a nice unnested dict ping_errors , which is indexed on datetime with precision to second (yes, you needn't index dict on strings although it might be more useful usually), so you can use filter on the dict to query the times you are interested in. 在此之后,您会得到一个不错的未嵌套字典ping_errors ,它在日期时间索引,精确到秒(是的,尽管通常它可能更有用,但您无需在字符串上对字典进行索引),因此您可以在字典上使用filter来查询您感兴趣的时间。

The dict will look like this: 该字典将如下所示:

{datetime.datetime(2019, 2, 17, 13, 20, 44): [4], datetime.datetime(2019, 2, 17, 13, 33, 11): [7, 8]}

which means that at 17/feb/2019, 13:20:44 we had 1 error in hop 4 and at 17/feb/2019, 13:33:11 we had 2 errors, in hop 7 & 8 respectively. 这意味着在17 / feb / 2019,13:20:44我们在跃点4中出现了1个错误,在17 / feb / 2019,13:33:11我们分别在跃点7和8中出现了2个错误。

Then, for example, a query to select how many pings have had errors been at 13 o'clock (any minute and second), would be (for the above fictional dict): 然后,例如,一个查询来选择有多少个ping在13点(每分一秒)出现错误,(对于上述虚构字典而言)是:

sum([len(ping_errors[x]) for x in ping_errors.keys() if x.time().hour==13])

which hops were affected during this time? 在这段时间内哪些啤酒花受到了影响?

[ping_errors[x] for x in ping_errors.keys() if x.time().hour==13]

what if we are only interested in seconds between 30 and 45? 如果我们只对30到45之间的秒感兴趣,该怎么办?

sum([len(ping_errors[x]) for x in ping_errors.keys() if x.time().second >= 30 and x.time().second <= 45 ])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM