简体   繁体   English

使用Python脚本搜索结果并将其导出到.csv文件

[英]Python script to search and export results to .csv file

I'm trying to do the following in Python, also using some bash scripting. 我正在尝试在Python中执行以下操作,还使用了一些bash脚本。 Unless there is an easier way in Python. 除非Python中有更简单的方法。

I have a log file with data that looks like the following: 我有一个日志文件,其数据如下所示:

16:14:59.027003 - WARN - Cancel Latency: 100ms - OrderId: 311yrsbj - On Venue: ABCD
16:14:59.027010 - WARN - Ack Latency: 25ms - OrderId: 311yrsbl - On Venue: EFGH
16:14:59.027201 - WARN - Ack Latency: 22ms - OrderId: 311yrsbn - On Venue: IJKL
16:14:59.027235 - WARN - Cancel Latency: 137ms - OrderId: 311yrsbp - On Venue: MNOP
16:14:59.027256 - WARN - Cancel Latency: 220ms - OrderId: 311yrsbr - On Venue: QRST
16:14:59.027293 - WARN - Ack Latency: 142ms - OrderId: 311yrsbt - On Venue: UVWX
16:14:59.027329 - WARN - Cancel Latency: 134ms - OrderId: 311yrsbv - On Venue: YZ  
16:14:59.027359 - WARN - Ack Latency: 75ms - OrderId: 311yrsbx - On Venue: ABCD
16:14:59.027401 - WARN - Cancel Latency: 66ms - OrderId: 311yrsbz - On Venue: ABCD
16:14:59.027426 - WARN - Cancel Latency: 212ms - OrderId: 311yrsc1 - On Venue: EFGH
16:14:59.027470 - WARN - Cancel Latency: 89ms - OrderId: 311yrsf7 - On Venue: IJKL  
16:14:59.027495 - WARN - Cancel Latency: 97ms - OrderId: 311yrsay - On Venue: IJKL

I need to extract the last entry from each line and then use each unique entry and search for every line and that it appears in and export it to a .csv file. 我需要从每一行中提取最后一个条目,然后使用每个唯一的条目并搜索每一行,并将其显示在其中并将其导出到.csv文件。

I've used the following bash script to get each unique entry: cat LogFile_ date +%Y%m%d .msg.log | 我使用以下bash脚本获取每个唯一条目:cat LogFile_ date +%Y%m%d .msg.log | awk '{print $14}' | awk'{print $ 14}'| sort | 排序 uniq 优衣库

Based on the above data in the log file, the bash script would return the following results: 根据日志文件中的上述数据,bash脚本将返回以下结果:

ABCD
EFGH
IJKL
MNOP
QRST
UVWX
YZ

Now I would like to search (or grep) for each of those results in the same log file and return the top ten results. 现在,我想在同一日志文件中搜索(或grep)每个结果,并返回前十个结果。 I have another bash script to do this, however, HOW DO I DO THIS USING A FOR LOOP? 我还有另一个bash脚本可以执行此操作,但是,该如何使用“循环”? So, for x, where x = each entry above, 因此,对于x,其中x =上面的每个条目,

grep x LogFile_ date +%Y%m%d .msg.log | grep x LogFile_ date +%Y%m%d .msg.log | awk '{print $7}' | awk'{print $ 7}'| sort -nr | 排序-nr | uniq | uniq | head -10 头-10

Then return the results into a .csv file. 然后将结果返回到.csv文件。 The results would look like this (each field in a separate column): 结果如下所示(每个字段在单独的列中):

Column-A  Column-B  Column-C  Column-D  
ABCD        2sxrb6ab    Cancel    46ms  
ABCD      2sxrb6af  Cancel    45ms  
ABCD      2sxrb6i2  Cancel    63ms  
ABCD      2sxrb6i3  Cancel    103ms  
EFGH      2sxrb6i4  Cancel    60ms  
EFGH      2sxrb6i7  Cancel    60ms  
IJKL      2sxrb6ie  Ack       74ms  
IJKL      2sxrb6if  Ack       74ms  
IJKL      2sxrb76s  Cancel    46ms  
MNOP      vcxrqrs5  Cancel    7651ms  

I'm a beginner in Python and haven't done much coding since college (13 years ago). 我是Python的初学者,自大学(十三年前)以来就没有做太多编程工作。 Any help would be greatly appreciated. 任何帮助将不胜感激。 Thanks. 谢谢。

Say you've opened your file. 假设您已打开文件。 What you want to do is record how many times each individual entry is in there, which is to say, each entry will result in one or more timings: 您要做的是记录每个单个条目在其中的次数,也就是说,每个条目将导致一个或多个计时:

from collections import defaultdict

entries = defaultdict(list)
for line in your_file:
    # Parse the line and return the 'ABCD' part and time
    column_a, timing = parse(line)
    entries[column_a].append(timing)

When you're done, you have a dictionary like so: 完成后,您将拥有如下字典:

{ 'ABCD': ['30ms', '25ms', '12ms'],
  'EFGH': ['12ms'],
  'IJKL': ['2ms', '14ms'] }

What you'll want to do now is transform this dictionary into another data structure ordered by len of its value (which is a list). 现在,您要做的就是将此字典转换成另一个按其值len排序的数据结构(这是一个列表)。 Example: 例:

In [15]: sorted(((k, v) for k, v in entries.items()), 
                key=lambda i: len(i[1]), reverse=True)
Out[15]: 
[('ABCD', ['30ms', '25ms', '12ms']),
 ('IJKL', ['2ms', '14ms']),
 ('EFGH', ['12ms'])]

Of course this is only illustrative and you might want to collect some more data in the original for loop. 当然,这仅是说明性的,您可能希望在原始for循环中收集更多数据。

Maybe not no concise as you might think ... But I think this can solve your problem. 也许不是您想像的那么简洁……但是我认为这可以解决您的问题。 I add some try...catch to better address real data. 我添加一些try ... catch以更好地处理真实数据。

import re
import os
import csv
import collections

# get all logfiles under current directory of course this pattern can be more
# sophisticated, but it's not our attention here, isn't it?
log_pattern = re.compile(r"LogFile_date[0-9]{8}.msg.log")
logfiles = [f for f in os.listdir('./') if log_pattern.match(f)]

# top n
nhead = 10
# used to parse useful fields
extract_pattern = re.compile(
    r'.*Cancel Latency: ([0-9]+ms) - OrderId: ([0-9a-z]+) - On Venue: ([A-Z]+)')
# container for final results
res = collections.defaultdict(list)

# parse out all interesting fields
for logfile in logfiles:
    with open(logfile, 'r') as logf:
        for line in logf:
            try:  # in case of blank line or line with no such fields.
                latency, orderid, venue = extract_pattern.match(line).groups()
            except AttributeError:
                continue
            res[venue].append((orderid, latency))

# write to csv
with open('res.csv', 'w') as resf:
    resc = csv.writer(resf, delimiter=' ')
    for venue in sorted(res.iterkeys()):  # sort by Venue
        entries = res[venue]
        entries.sort()  # sort by OrderId
        for i in range(0, nhead):
            try:
                resc.writerow([venue, entries[i][0], 'Cancel ' + entries[i][1]])
            except IndexError:  # nhead can not be satisfied
                break

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM