[英]Python, remove specific columns from file
I have a file in the following format ; 我有一个以下格式的文件;
string1 string2 ........ stringN
value1,1 value1,2 ........ value1,N
. . ........ .
. . ........ .
. . ........ .
valueM,1 valueM,2 ........ valueM,N
M is on the scale of 10000 N is on the scale of 100 M在10000的尺度上N在100的尺度上
Which I need to; 我需要
from this file respectively. 从该文件分别。
it gets very tricky with numpy since there are strings (titles of each column) in this data as well. numpy变得非常棘手,因为此数据中也包含字符串(每列的标题)。 I would appreciate any guidance. 我将不胜感激。
You have a custom ASCII-table-like format with fixed-with columns: 您有一种自定义的类似于ASCII表的格式,带有固定列:
*********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
* Row * Instance * test_string * test_string * test_string * test_string * test_string * test_string * test_string * string__722 * string__722 * string__722 * string__722 * string__722 * string__722 * string__722 * string__720 * string__720 * string__720 * string__720 * string__720 * string__720 * string__720 * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * string__718 * string__718 * string__718 * string__718 * string__718 * string__718 * string__718 * string__719 * string__719 * string__719 * string__719 * string__719 * string__719 * string__719 * string__723 * string__723 * string__723 * string__723 * string__723 * string__723 * string__723 * string__721 * string__721 * string__721 * string__721 * string__721 * string__721 * string__721 * another_str * another_str * another_str * another_str * another_str * another_str * another_str * another_str * another_str *
*********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
* 0 * 0 * 0 * 50331648 * test_string * 2 * 1 * 13 * 5.76460e+18 * 0 * 50331648 * string__722 * 2 * 1 * 606 * 5.83666e+18 * 0 * 50331648 * string__720 * 2 * 1 * 575 * 5.83666e+18 * 0 * 50331648 * HCAL_SlowDa * 2 * 1 * 36 * 5.76460e+18 * 0 * 50331648 * string__718 * 2 * 1 * 529 * 5.83666e+18 * 0 * 50331648 * string__719 * 2 * 1 * 529 * 5.83666e+18 * 0 * 50331648 * string__723 * 2 * 1 * 529 * 5.83666e+18 * 0 * 50331648 * string__721 * 2 * 1 * 529 * 5.83666e+18 * 0 * 50331648 * 212135 * 15080 * 1 * 1 * 3340 * 1057 * 1.399999976 *
* 0 * 1 * 0 * 50331648 * * 2 * 1 * 13 * 0 * 0 * 50331648 * * 2 * 1 * 606 * 53440 * 0 * 50331648 * * 2 * 1 * 575 * 53440 * 0 * 50331648 * * 2 * 1 * 36 * 0 * 0 * 50331648 * * 2 * 1 * 529 * 53440 * 0 * 50331648 * * 2 * 1 * 529 * 53440 * 0 * 50331648 * * 2 * 1 * 529 * 53440 * 0 * 50331648 * * 2 * 1 * 529 * 53440 * 0 * 50331648 * 212135 * * 1 * 1 * 3340 * 1057 * 1.399999976 *
* 0 * 2 * 0 * 50331648 * * 2 * 1 * 13 * 4294970636 * 0 * 50331648 * * 2 * 1 * 606 * 1.09780e+16 * 0 * 50331648 * * 2 * 1 * 575 * 1.09780e+16 * 0 * 50331648 * * 2 * 1 * 36 * 2.70217e+16 * 0 * 50331648 * * 2 * 1 * 529 * 1.09780e+16 * 0 * 50331648 * * 2 * 1 * 529 * 1.09780e+16 * 0 * 50331648 * * 2 * 1 * 529 * 1.09780e+16 * 0 * 50331648 * * 2 * 1 * 529 * 1.09780e+16 * 0 * 50331648 * 212135 * * 1 * 1 * 3340 * 1057 * 1.399999976 *
* 0 * 3 * 0 * 50331648 * * 2 * 1 * 13 * 352321545 * 0 * 50331648 * * 2 * 1 * 606 * 2.30610e+18 * 0 * 50331648 * * 2 * 1 * 575 * 2.30610e+18 * 0 * 50331648 * * 2 * 1 * 36 * 7.30102e+18 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * 212135 * * 1 * 1 * 3340 * 1057 * 1.399999976 *
* 0 * 4 * 0 * 50331648 * * 2 * 1 * 13 * 0 * 0 * 50331648 * * 2 * 1 * 606 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 575 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 36 * 2.82590e+16 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * 212135 * * 1 * 1 * 3340 * 1057 * 1.399999976 *
If we assume that none of the actual data fields contain asterisks themselves, the easiest way to read each row is to use a regular expression to split out the lines. 如果我们假设实际的数据字段本身都不包含星号,则读取每一行的最简单方法是使用正则表达式将行分开。
To output, I'd still use the csv
module , because that would make future processing that much easier: 要输出,我仍将使用csv
模块 ,因为这将使以后的处理变得更加容易:
import csv
import re
from itertools import islice
row_split = re.compile('\s*\*\s*')
with open(someinputfile, 'rb') as infile, open(outputfile, 'wb') as outfile:
writer = csv.writer(outfile, delimiter='\t')
next(islice(infile, 3, 3), None) # skip the first 3 lines in the input file
for line in infile:
row = row_split.split(line)[1:-1]
if not row: continue
writer.writerow(row[8::7])
This skips empty rows, and writes only every 7th column (counting from number nine) and skips the rest. 这将跳过空的行,并仅写入每第7列(从第9列开始计数),并跳过其余的列。
The first row thus is: 因此,第一行是:
['5.76460e+18', '5.83666e+18', '5.83666e+18', '5.76460e+18', '5.83666e+18', '5.83666e+18', '5.83666e+18', '5.83666e+18', '3340']
This is removing empty lines: 这将删除空行:
filtered = filter(lambda x: not re.match(r'^\s*$', x), original)
To remove a specific column (I assume your data is stored in a text file): 要删除特定的列(我假设您的数据存储在文本文件中):
f = open("textfile.txt","r")
lines = f.readlines()
f.close()
f = open("newfile.txt","w")
Write your lines back, except the lines you want to delete: 写回您的行,但要删除的行除外:
list = [0, 1, 6, 13, 20] # remove first,second as well as 7th, 14th and 21th line
for i,line in enumerate(lines):
if i not in list:
f.write(line)
At the end, close the file again. 最后,再次关闭文件。
f.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.