Python, remove specific columns from file

Question

I have a file in the following format ;

string1     string2         ........    stringN

value1,1    value1,2    ........    value1,N
   .            .       ........        .
   .            .       ........        .
   .            .       ........        .
valueM,1    valueM,2    ........    valueM,N

M is on the scale of 10000 N is on the scale of 100

Which I need to;

remove empty lines
remove first two columns
keep 7th,14th,21th ... columns and delete the rest

from this file respectively.

it gets very tricky with numpy since there are strings (titles of each column) in this data as well. I would appreciate any guidance.

Answer 1

You have a custom ASCII-table-like format with fixed-with columns:

*********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
*    Row   * Instance * test_string * test_string * test_string * test_string * test_string * test_string * test_string * string__722 * string__722 * string__722 * string__722 * string__722 * string__722 * string__722 * string__720 * string__720 * string__720 * string__720 * string__720 * string__720 * string__720 * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * string__718 * string__718 * string__718 * string__718 * string__718 * string__718 * string__718 * string__719 * string__719 * string__719 * string__719 * string__719 * string__719 * string__719 * string__723 * string__723 * string__723 * string__723 * string__723 * string__723 * string__723 * string__721 * string__721 * string__721 * string__721 * string__721 * string__721 * string__721 * another_str * another_str * another_str * another_str * another_str * another_str * another_str * another_str * another_str *
*********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
*        0 *        0 *           0 *    50331648 * test_string *           2 *           1 *          13 * 5.76460e+18 *           0 *    50331648 * string__722 *           2 *           1 *         606 * 5.83666e+18 *           0 *    50331648 * string__720 *           2 *           1 *         575 * 5.83666e+18 *           0 *    50331648 * HCAL_SlowDa *           2 *           1 *          36 * 5.76460e+18 *           0 *    50331648 * string__718 *           2 *           1 *         529 * 5.83666e+18 *           0 *    50331648 * string__719 *           2 *           1 *         529 * 5.83666e+18 *           0 *    50331648 * string__723 *           2 *           1 *         529 * 5.83666e+18 *           0 *    50331648 * string__721 *           2 *           1 *         529 * 5.83666e+18 *           0 *    50331648 *      212135 *       15080 *           1 *           1 *        3340 *        1057 * 1.399999976 *
*        0 *        1 *           0 *    50331648 *             *           2 *           1 *          13 *           0 *           0 *    50331648 *             *           2 *           1 *         606 *       53440 *           0 *    50331648 *             *           2 *           1 *         575 *       53440 *           0 *    50331648 *             *           2 *           1 *          36 *           0 *           0 *    50331648 *             *           2 *           1 *         529 *       53440 *           0 *    50331648 *             *           2 *           1 *         529 *       53440 *           0 *    50331648 *             *           2 *           1 *         529 *       53440 *           0 *    50331648 *             *           2 *           1 *         529 *       53440 *           0 *    50331648 *      212135 *             *           1 *           1 *        3340 *        1057 * 1.399999976 *
*        0 *        2 *           0 *    50331648 *             *           2 *           1 *          13 *  4294970636 *           0 *    50331648 *             *           2 *           1 *         606 * 1.09780e+16 *           0 *    50331648 *             *           2 *           1 *         575 * 1.09780e+16 *           0 *    50331648 *             *           2 *           1 *          36 * 2.70217e+16 *           0 *    50331648 *             *           2 *           1 *         529 * 1.09780e+16 *           0 *    50331648 *             *           2 *           1 *         529 * 1.09780e+16 *           0 *    50331648 *             *           2 *           1 *         529 * 1.09780e+16 *           0 *    50331648 *             *           2 *           1 *         529 * 1.09780e+16 *           0 *    50331648 *      212135 *             *           1 *           1 *        3340 *        1057 * 1.399999976 *
*        0 *        3 *           0 *    50331648 *             *           2 *           1 *          13 *   352321545 *           0 *    50331648 *             *           2 *           1 *         606 * 2.30610e+18 *           0 *    50331648 *             *           2 *           1 *         575 * 2.30610e+18 *           0 *    50331648 *             *           2 *           1 *          36 * 7.30102e+18 *           0 *    50331648 *             *           2 *           1 *         529 * 1.15294e+19 *           0 *    50331648 *             *           2 *           1 *         529 * 1.15294e+19 *           0 *    50331648 *             *           2 *           1 *         529 * 1.15294e+19 *           0 *    50331648 *             *           2 *           1 *         529 * 1.15294e+19 *           0 *    50331648 *      212135 *             *           1 *           1 *        3340 *        1057 * 1.399999976 *
*        0 *        4 *           0 *    50331648 *             *           2 *           1 *          13 *           0 *           0 *    50331648 *             *           2 *           1 *         606 * 1.15294e+19 *           0 *    50331648 *             *           2 *           1 *         575 * 1.15294e+19 *           0 *    50331648 *             *           2 *           1 *          36 * 2.82590e+16 *           0 *    50331648 *             *           2 *           1 *         529 * 1.15294e+19 *           0 *    50331648 *             *           2 *           1 *         529 * 1.15294e+19 *           0 *    50331648 *             *           2 *           1 *         529 * 1.15294e+19 *           0 *    50331648 *             *           2 *           1 *         529 * 1.15294e+19 *           0 *    50331648 *      212135 *             *           1 *           1 *        3340 *        1057 * 1.399999976 *

If we assume that none of the actual data fields contain asterisks themselves, the easiest way to read each row is to use a regular expression to split out the lines.

To output, I'd still use the csv module , because that would make future processing that much easier:

import csv
import re
from itertools import islice

row_split = re.compile('\s*\*\s*')

with open(someinputfile, 'rb') as infile, open(outputfile, 'wb') as outfile:
    writer = csv.writer(outfile, delimiter='\t')

    next(islice(infile, 3, 3), None) # skip the first 3 lines in the input file

    for line in infile:
        row = row_split.split(line)[1:-1]
        if not row: continue
        writer.writerow(row[8::7])

This skips empty rows, and writes only every 7th column (counting from number nine) and skips the rest.

The first row thus is:

['5.76460e+18', '5.83666e+18', '5.83666e+18', '5.76460e+18', '5.83666e+18', '5.83666e+18', '5.83666e+18', '5.83666e+18', '3340']

Answer 2

This is removing empty lines:

filtered = filter(lambda x: not re.match(r'^\s*$', x), original)

To remove a specific column (I assume your data is stored in a text file):

f = open("textfile.txt","r")
lines = f.readlines()
f.close()
f = open("newfile.txt","w")

Write your lines back, except the lines you want to delete:

list = [0, 1, 6, 13, 20] # remove first,second as well as 7th, 14th and 21th line
for i,line in enumerate(lines):
  if i not in list:
    f.write(line)

At the end, close the file again.

f.close()

Python, remove specific columns from file

Question

2 answers

solution1
3 ACCPTED 2013-04-20 18:39:08

solution2
0 2013-04-20 18:38:13

Python, remove specific columns from file

Question

2 answers

solution1 3 ACCPTED 2013-04-20 18:39:08

solution2 0 2013-04-20 18:38:13

solution1
3 ACCPTED 2013-04-20 18:39:08

solution2
0 2013-04-20 18:38:13