简体   繁体   English

为什么我的代码会停止?

[英]Why is my code stopping?

Hey I've encountered an issue where my program stops iterating through the file at the 57802 record for some reason I cannot figure out. 嘿我遇到了一个问题,我的程序停止在57802记录中迭代文件由于某些原因我无法弄清楚。 I put a heartbeat section in so I would be able to see which line it is on and it helped but now I am stuck as to why it stops here. 我放了一个心跳部分,所以我可以看到它在哪条线上并且它有所帮助,但现在我被困在为什么它停在这里。 I thought it was a memory issue but I just ran it on my 6GB memory computer and it still stopped. 我认为这是一个内存问题,但我只是在我的6GB内存计算机上运行它仍然停止。

Is there a better way to do anything I am doing below? 有没有更好的方法来做我在下面做的任何事情? My goal is to read the file (if you need me to send it to you I can 15MB text log) find a match based on the regex expression and print the matching line. 我的目标是读取文件(如果您需要我发送给您,我可以15MB文本日志)根据正则表达式找到匹配并打印匹配行。 More to come but that's as far as I have gotten. 还有更多,但就我而言。 I am using python 2.6 我正在使用python 2.6

Any ideas would help and code comments also! 任何想法也会帮助和编码评论! I am a python noob and am still learning. 我是一个python noob,我还在学习。

import sys, os, os.path, operator
import re, time, fileinput

infile = os.path.join("C:\\","Python26","Scripts","stdout.log")

start = time.clock()

filename  = open(infile,"r")

match = re.compile(r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}),\d{3} +\w+ +\[([\w.]+)\] ((\w+).?)+:\d+ - (\w+)_SEARCH:(.+)')

count = 0
heartbeat = 0
for line in filename:
    heartbeat = heartbeat + 1
    print heartbeat
    lookup = match.search(line)
    if lookup:
        count = count + 1
        print line
end = time.clock()
elapsed = end-start
print "Finished processing at:",elapsed,"secs. Count of records =",count,"."

filename.close()

This is line 57802 where it fails: 这是第57802行,它失败了:

2010-08-06 08:15:15,390 DEBUG [ah_admin] com.thg.struts2.SecurityInterceptor.intercept:46 - Action not SecurityAware; skipping privilege check.

This is a matching line: 这是一个匹配的行:

2010-08-06 09:27:29,545 INFO  [patrick.phelan] com.thg.sam.actions.marketmaterial.MarketMaterialAction.result:223 - MARKET_MATERIAL_SEARCH:{"_appInfo":{"_appId":21,"_companyDivisionId":42,"_environment":"PRODUCTION"},"_description":"symlin","_createdBy":"","_fieldType":"GEO","_geoIds":["Illinois"],"_brandIds":[2883],"_archived":"ACTIVE","_expired":"UNEXPIRED","_customized":"CUSTOMIZED","_webVisible":"VISIBLE_ONLY"}

Sample data just the first 5 lines: 样本数据只是前5行:

2010-08-06 00:00:00,035 DEBUG [] com.thg.sam.jobs.PlanFormularyLoadJob.executeInternal:67 - Entered into PlanFormularyLoadJob: executeInternal
2010-08-06 00:00:00,039 DEBUG [] com.thg.ftpComponent.service.JScapeFtpService.open:153 - Opening FTP connection to sdrive/hibbert@tccfp01.hibbertnet.com:21
2010-08-06 00:00:00,040 DEBUG [] com.thg.sam.email.EmailUtils.sendEmail:206 - org.apache.commons.mail.MultiPartEmail@446e79
2010-08-06 00:00:00,045 DEBUG [] com.thg.sam.services.OrderService.getOrdersWithStatus:121 - Orders list size=13
2010-08-06 00:00:00,045 DEBUG [] com.thg.ftpComponent.service.JScapeFtpService.open:153 - Opening FTP connection to sdrive/hibbert@tccfp01.hibbertnet.com:21

What does the input line that gives you trouble look like? 给您带来麻烦的输入线是什么样的? I'd try printing that out. 我试着打印出来。 I suspect your CPU is pegged while this is running. 我怀疑你的CPU在运行时是挂钩的。

Nested regexps, like you have can have VERY bad performance when they don't match quickly. 嵌套的regexp,就像你没有快速匹配时可能会有非常糟糕的性能。

((\w+).?)+:

Imagine a string that doesn't have the : in it but is fairly long. 想象一个字符串,它没有:在其中,但相当长。 You'll end up in a world of backtracking as the regexp tries EVERY combination of ways to separate word characters between \\w and . 你将最终进入一个回溯的世界,因为正则表达式试图在\\ w和\\ n之间分隔单词字符的各种方法组合。 and THEN tries to group them in every way possible. 然后尝试以各种可能的方式对它们进行分组。 If you can be more specific in your pattern it'll pay off big time. 如果你可以在你的模式中更具体,它将带来丰厚的回报。

Your problem is definitely the part @paulrubel pointed out: 你的问题肯定是@paulrubel指出的部分:

((\w+).?)+:\d+

Now that you've added sample data, it's obvious that the . 既然您已经添加了样本数据,那么很明显. is supposed to match a literal dot, which means you should have escaped it ( \\. ). 应该匹配一个文字点,这意味着你应该逃脱它( \\. )。 Also, you don't need the inner set of parentheses, and the outer set should be non-capturing, but it's the basic structure that's killing you; 此外,你不需要内部括号集,外部集应该是非捕获的,但它是杀死你的基本结构; there are too many arrangements of word characters and dots it has to try before giving up. 在放弃之前必须尝试的字符和点的排列太多。 The other lines all fail before that part of the regex is attempted, which is why you don't have any problem with them. 在尝试使用正则表达式的部分之前,其他所有行都会失败,这就是为什么你没有任何问题。

When I try it in RegexBuddy, your regex matches the good line in 186 steps, and gives up trying on line 57802 after 1,000,000 steps. 当我在RegexBuddy中尝试时,你的正则表达式在186步中与好线匹配,并且在1,000,000步之后放弃尝试57802线。 When I escape the dot, the good line only takes 90 steps to match, but it still times out on line 57802. But now I know that part of the regex can only match word characters and dots. 当我逃离点时,好的线只需要90步才能匹配,但它仍然在57802线上超时。但现在我知道正则表达式的一部分只能匹配单词字符和点。 Once it has consumed all of those it can, the next bit has to match :\\d+ ; 一旦它消耗了所有可能的,下一位必须匹配:\\d+ ; if it doesn't, I know there's no point trying other arrangements. 如果没有,我知道尝试其他安排是没有意义的。 I can use an atomic group to tell it not to bother: 我可以使用原子组告诉它不要打扰:

(?>(?:\w+\.?)+):\d+

With that change, the good line matches in 83 steps, and line 57802 only takes 66 steps to report failure. 通过该更改,良好的行匹配83个步骤,而行57802仅需要66个步骤来报告失败。 But it's not always feasible to use atomic groups, so you should try to make your regex conform to the actual structure of the text it's matching. 但是使用原子组并不总是可行的,所以你应该尝试使你的正则表达式符合它匹配的文本的实际结构。 In this case you're matching what looks like a Java class name (some word characters, followed by zero or more instances of (a dot and some more word characters)) followed by a colon and and a line number: 在这种情况下,您将匹配看起来像Java类名称的内容(一些单词字符,后跟零个或多个(一个点和一些单词字符)的实例),后跟冒号和行号:

\w+(?:\.\w+)*:\d+

When I plug that into the regex, it matches the good line in 80 steps, and rejects line 57802 in 67 steps--the atomic group isn't even needed. 当我将其插入正则表达式时,它与80行中的良好行匹配,并在67步中拒绝行57802 - 甚至不需要原子组。

you compiled your regex but never use it? 你编译了你的正则表达式但从未使用它?

lookup = re.search(match,line)

should be 应该

lookup = match.search(line)

and you should use os.path.join() 你应该使用os.path.join()

infile = os.path.join("C:\\","Python26","Scripts","stdout.log")

Update: 更新:

Your regular expression can be simpler.Just check for the date time stamp. 您的正则表达式可以更简单。只需检查日期时间戳。 Or else, don't use regular expression at all. 否则,根本不要使用正则表达式。 Say your date and time starts at beginning of line 说你的日期和时间从行首开始

for line in open("stdout.log"):
    s = line.split()
    D,T=s[0],s[1]
    # use the time module and strptime to check valid date/time
    # or you can split "-" on D and T and do manual check using > or < and math

Your pattern contains the fixed string SEARCH_ and a bunch of complicated expressions (including captures) that are really going to hammer the regex engine.. but you don't do anything with the captured text so all you want to know 'is does it match?' 你的模式包含固定的字符串SEARCH_和一堆复杂的表达式(包括捕获),这些表达式确实会破坏正则表达式引擎..但你不会对捕获的文本做任何事情,所以你想要知道的是它是否匹配?

It may be simpler and quicker to just search for the fixed pattern on each line. 在每一行上搜索固定模式可能更简单,更快捷。

if '_SEARCH:' in line:
    print line
    count += 1

Try using pdb . 尝试使用pdb If you put pdb.set_trace() in your heartbeat shortly before it stops, you can look at the specific line it's stopping on and see what each of your lines of code does with that line. 如果在停止之前不久将pdb.set_trace()置于心跳中,您可以查看它停止的特定行,并查看每行代码对该行的作用。

Edit: An example of pdb use: 编辑:pdb使用的一个例子:

import pdb
for i in range(50):
    print i
    if i == 12:
        pdb.set_trace()

Run that script, and you'll get something like the following: 运行该脚本,您将获得以下内容:

0
1
2
3
4
5
6
7
8
9
10
11
12
> <stdin>(1)<module>()
(Pdb)

Now you can evaluate Python expressions from the context of i=12. 现在,您可以从i = 12的上下文中评估Python表达式。

(Pdb) print i
12

Use that, but put the pdb.set_trace() in your loop after you increment heartbeat, if heartbeat == 57802 . 使用它,但在增加心跳后将pdb.set_trace()放入循环中,如果heartbeat == 57802 Then you can print out line with p line , the result of your regex search with p match.search(line) , etc. 然后你可以用p line打印出一行,用p match.search(line)等你的正则表达式搜索结果。

It might be a memory issue anyway. 无论如何,它可能是一个记忆问题。 With huge files it's probably better to use the fileinput module instead like this: 对于庞大的文件,最好使用fileinput模块,如下所示:

import fileinput
for line in fileinput.input([infile]):
    lookup = re.search(match, line)
     # etc.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM