简体   繁体   中英

Iterate through file but ignore certain line break characters?

I know that I can read the entire file into memory and simply replace the offending character in memory then iterate through the stored file, but I don't want to do that because these are MASSIVE text files (often exceeding 4GB).

With that said, I want to iterate line by line through a file (which has been properly encoded as utf-8 using codecs) but I don't want line breaks to occur on the \\x0b ( \\v ) character. Unfortunately, there is some binary data that shows up in my file that has the \\x0b character. Naturally, this causes a line break which ends up splitting up some lines that I need to keep intact. I'd like to ignore this character when determining where line breaks should occur while iterating through the file.

Is there a parameter or approach that will enable me to do this? I'm ok with writing my own generator to iterate line by line through the file by specifying my own valid line break characters, but I'm not sure if there isn't a simpler approach, and I'm not sure how to do this since I'm using the codecs library to handle encoding.

Here are some (sanitized) sample data:

Record#|EventID|Date| Time-UTC|Level|computer name|param_01|param_02|param_03|param_04|param_05|param_06|source name|event log
84491|682|03/19/2015| 21:59:16.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0xF38058)|RDP-Tcp#12|RogueApp|10.3.98.6|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90582|682|04/03/2015| 14:42:14.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#5|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90613|682|04/03/2015| 16:26:03.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºà¨€A਀Aì°†éªá… ê±ºà¬€A଀Aé¶é«á… Ö Î„|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
90626|682|04/03/2015| 16:57:35.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x35BDF)|RDP-Tcp#11|RogueApp|10.3.98.14|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91018|682|04/04/2015| 13:56:13.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#33|Anonymous|10.3.58.13|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91038|682|04/04/2015| 14:09:19.000|a-pass|WKS-WINXP32BIT|sample_user|SampleGroup|(0x0,0x100513C)|RDP-Tcp#39|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ¸€x渀xì°†éªá… ê±ºæ¬€x欀xé¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91064|682|04/04/2015| 15:25:33.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x11FA916)|RDP-Tcp#43|CONTROLLER|10.3.58.4|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91163|682|04/04/2015| 16:40:19.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#2|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºá´€æ®–ᴀ殖찆éªá… ê±ºã¬€æ®–㬀殖é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91204|682|04/04/2015| 18:10:55.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#5|Anonymous's Mac|192.168.1.18ì°†éªá…°ê±ºæ˜€æ˜€ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91545|682|04/05/2015| 13:41:58.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#7|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºìˆ€ìˆ€ì°†éªá… ê±ºëŒ€ëŒ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
91567|682|04/05/2015| 14:42:21.000|a-pass|WKS-WINXP32BIT|Anonymous|SampleGroup|(0x0,0x37D49)|RDP-Tcp#9|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºæ €æ €ì°†éªá… ê±ºæ„€æ„€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt
92120|682|04/06/2015| 19:06:43.000|a-pass|WKS-WINXP32BIT|ACN-Helpdesk|WKS-WINXP32BIT|(0x0,0x3D6DB)|RDP-Tcp#2|Anonymous's Mac|192.168.1.14ì°†éªá…°ê±ºç„€ç„€ì°†éªá… ê±ºçœ€çœ€é¶é«á… Ð€Ì€|Security|C:\Users\sampleuser\EventLogs\problem-child\SecEvent.Evt

It parses everything fine except for the very last row. Yes I know there shouldn't be binary data in a CSV file, but there is. And I have no choice in that matter.

>>> with open("out.test","wb") as f:
...    f.write("a\va\nb\rq")
...
>>> for line in open("out.test","rb"):
...    print line.decode("utf8")
...
a♂a

q

seems fine in python 2.7 ... what kind of encoding is this file that this wont work?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM