简体   繁体   English

从python中的大文本文件中删除特定行

[英]Remove specific lines from a large text file in python

I have several large text text files that all have the same structure and I want to delete the first 3 lines and then remove illegal characters from the 4th line. 我有几个大文本文本文件都具有相同的结构,我想删除前3行,然后从第4行删除非法字符。 I don't want to have to read the entire dataset and then modify as each file is over 100MB with over 4 million records. 我不想读取整个数据集然后修改,因为每个文件超过100MB,超过400万条记录。

Range   150.0dB -64.9dBm
Mobile unit 1   Base    -17.19968    145.40369  999.8
Fixed unit  2   Mobile  -17.20180    145.29514  533.0
Latitude    Longitude   Rx(dB)  Best unit
-17.06694    145.23158  -050.5  2
-17.06695    145.23297  -044.1  2

So lines 1,2 and 3 should be deleted and in line 4, "Rx(db)" should be just "Rx" and "Best Unit" be changed to "Best_Unit". 因此应该删除第1,2和3行,在第4行中,“Rx(db)”应该只是“Rx”,“Best Unit”应该更改为“Best_Unit”。 Then I can use my other scripts to geocode the data. 然后我可以使用我的其他脚本对数据进行地理编码。

I can't use commandline programs like grep ( as in this question ) as the first 3 lines are not all the same -the numbers (such as 150.0dB, -64*) will change in each file so you have to just delete the whole of lines 1-3 and then grep or similar can do the search-replace on line 4. 我不能使用像grep这样的命令行程序( 如本问题所示 )因为前3行并不完全相同 - 每个文件中的数字(例如150.0dB,-64 *)都会改变,所以你必须删除整行1-3然后grep或类似可以在第4行进行搜索替换。

Thanks guys, 多谢你们,

=== EDIT new pythonic way to handle larger files from @heltonbiker. ===编辑新的pythonic方式来处理来自@heltonbiker的更大文件。 Error. 错误。

import os, re
##infile = arcpy.GetParameter(0)
##chunk_size = arcpy.GetParameter(1) # number of records in each dataset

infile='trc_emerald.txt'
fc= open(infile)
Name = infile[:infile.rfind('.')]
outfile = Name+'_db.txt'

line4 = fc.readlines(100)[3]
line4 = re.sub('\([^\)].*?\)', '', line4)
line4 = re.sub('Best(\s.*?)', 'Best_', line4)
newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]])
fc.close()
newfile = open(outfile, 'w')
newfile.write(newfilestring)
newfile.close()

del lines
del outfile
del Name
#return chunk_size, fl
#arcpy.SetParameterAsText(2, fl)
print "Completed"

Traceback (most recent call last): File "P:\\2012\\Job_044_DM_Radio_Propogation\\Working\\FinalPropogation\\TRC_Emerald\\working\\clean_file_1c.py", line 13, in newfilestring = ''.join(line4 + [line for line in fc.readlines[4:]]) TypeError: 'builtin_function_or_method' object is unsubscriptable 回溯(最近一次调用最后一次):文件“P:\\ 2012 \\ Job_044_DM_Radio_Propogation \\ Working \\ FinalPropogation \\ TRC_Emerald \\ working \\ clean_file_1c.py”,第13行,in newfilestring =''。join(第4行+ [fc行中的行]。 readlines [4:]])TypeError:'builtin_function_or_method'对象是unsubscriptable

As wim said in the comments, sed is the right tool for this. 正如Wim在评论中所说, sed是适合这种情况的工具。 The following command should do what you want: 以下命令应该执行您想要的操作:

sed -i -e '4 s/(dB)//' -e '4 s/Best Unit/Best_Unit/' -e '1,3 d' yourfile.whatever

To explain the command a little: 稍微解释一下这个命令:

-i executes the command in place, that is it writes the output back into the input file -i执行命令,即将输出写回输入文件

-e execute a command -e执行命令

'4 s/(dB)//' on line 4 , subsitute '' for '(dB)' '4 s/(dB)//'线4 ,替补多'''(dB)'

'4 s/Best Unit/Best_Unit/' same as above, except different find and replace strings '4 s/Best Unit/Best_Unit/'与上面相同,除了不同的查找和替换字符串

'1,3 d' from line 1 to line 3 (inclusive) delete the entire line 从第1行到第3行(包括)的'1,3 d'删除整行

sed is a really powerful tool, which can do much more than just this, well worth learning. sed是一个非常强大的工具,它可以做的不仅仅是这个,非常值得学习。

Just try it for each file. 只需为每个文件尝试一下。 100 MB per file is not that big, and as you can see, the code to just make an attempt is not time-consuming to write. 每个文件100 MB并不是那么大,正如您所看到的,只是尝试的代码编写起来并不费时。

with open('file.txt') as f:
  lines = f.readlines()
lines[:] = lines[3:]
lines[0] = lines[0].replace('Rx(db)', 'Rx')
lines[0] = lines[0].replace('Best Unit', 'Best_Unit')
with open('output.txt', 'w') as f:
  f.write('\n'.join(lines))

You can use file.readlines() with an aditional argument in order to read just a few first lines from the file. 您可以将file.readlines()与aditional参数一起使用,以便从文件中只读取几行。 From the docs: 来自文档:

f.readlines() returns a list containing all the lines of data in the file. f.readlines()返回一个包含文件中所有数据行的列表。 If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. 如果给定一个可选的参数sizehint,它会从文件读取多个字节,并且足以完成一行,并从中返回行。 This is often used to allow efficient reading of a large file by lines, but without having to load the entire file in memory. 这通常用于允许按行有效读取大文件,但无需将整个文件加载到内存中。 Only complete lines will be returned. 只返回完整的行。

Then the most robust way to manipulate generic strings are Regular Expressions. 然后,操纵通用字符串的最强大的方法是正则表达式。 In Python, this means the re module with, for example, the re.sub() function. 在Python中,这意味着re模块具有例如re.sub()函数。

My suggestion, which should be adapted to suit your needs: 我的建议,应根据您的需求进行调整:

import re

f = open('somefile.txt')
line4 = f.readlines(100)[3]
line4 = re.sub('\([^\)].*?\)', '', line4)
line4 = re.sub('Best(\s.*?)', 'Best_', line4)
newfilestring = ''.join(line4 + [line for line in f.readlines[4:]])
f.close()
newfile = open('someotherfile.txt', 'w')
newfile.write(newfilestring)
newfile.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM