I have a file in the following format ;
string1 string2 ........ stringN
value1,1 value1,2 ........ value1,N
. . ........ .
. . ........ .
. . ........ .
valueM,1 valueM,2 ........ valueM,N
M is on the scale of 10000 N is on the scale of 100
Which I need to;
from this file respectively.
it gets very tricky with numpy since there are strings (titles of each column) in this data as well. I would appreciate any guidance.
You have a custom ASCII-table-like format with fixed-with columns:
*********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
* Row * Instance * test_string * test_string * test_string * test_string * test_string * test_string * test_string * string__722 * string__722 * string__722 * string__722 * string__722 * string__722 * string__722 * string__720 * string__720 * string__720 * string__720 * string__720 * string__720 * string__720 * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * HCAL_SlowDa * string__718 * string__718 * string__718 * string__718 * string__718 * string__718 * string__718 * string__719 * string__719 * string__719 * string__719 * string__719 * string__719 * string__719 * string__723 * string__723 * string__723 * string__723 * string__723 * string__723 * string__723 * string__721 * string__721 * string__721 * string__721 * string__721 * string__721 * string__721 * another_str * another_str * another_str * another_str * another_str * another_str * another_str * another_str * another_str *
*********************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
* 0 * 0 * 0 * 50331648 * test_string * 2 * 1 * 13 * 5.76460e+18 * 0 * 50331648 * string__722 * 2 * 1 * 606 * 5.83666e+18 * 0 * 50331648 * string__720 * 2 * 1 * 575 * 5.83666e+18 * 0 * 50331648 * HCAL_SlowDa * 2 * 1 * 36 * 5.76460e+18 * 0 * 50331648 * string__718 * 2 * 1 * 529 * 5.83666e+18 * 0 * 50331648 * string__719 * 2 * 1 * 529 * 5.83666e+18 * 0 * 50331648 * string__723 * 2 * 1 * 529 * 5.83666e+18 * 0 * 50331648 * string__721 * 2 * 1 * 529 * 5.83666e+18 * 0 * 50331648 * 212135 * 15080 * 1 * 1 * 3340 * 1057 * 1.399999976 *
* 0 * 1 * 0 * 50331648 * * 2 * 1 * 13 * 0 * 0 * 50331648 * * 2 * 1 * 606 * 53440 * 0 * 50331648 * * 2 * 1 * 575 * 53440 * 0 * 50331648 * * 2 * 1 * 36 * 0 * 0 * 50331648 * * 2 * 1 * 529 * 53440 * 0 * 50331648 * * 2 * 1 * 529 * 53440 * 0 * 50331648 * * 2 * 1 * 529 * 53440 * 0 * 50331648 * * 2 * 1 * 529 * 53440 * 0 * 50331648 * 212135 * * 1 * 1 * 3340 * 1057 * 1.399999976 *
* 0 * 2 * 0 * 50331648 * * 2 * 1 * 13 * 4294970636 * 0 * 50331648 * * 2 * 1 * 606 * 1.09780e+16 * 0 * 50331648 * * 2 * 1 * 575 * 1.09780e+16 * 0 * 50331648 * * 2 * 1 * 36 * 2.70217e+16 * 0 * 50331648 * * 2 * 1 * 529 * 1.09780e+16 * 0 * 50331648 * * 2 * 1 * 529 * 1.09780e+16 * 0 * 50331648 * * 2 * 1 * 529 * 1.09780e+16 * 0 * 50331648 * * 2 * 1 * 529 * 1.09780e+16 * 0 * 50331648 * 212135 * * 1 * 1 * 3340 * 1057 * 1.399999976 *
* 0 * 3 * 0 * 50331648 * * 2 * 1 * 13 * 352321545 * 0 * 50331648 * * 2 * 1 * 606 * 2.30610e+18 * 0 * 50331648 * * 2 * 1 * 575 * 2.30610e+18 * 0 * 50331648 * * 2 * 1 * 36 * 7.30102e+18 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * 212135 * * 1 * 1 * 3340 * 1057 * 1.399999976 *
* 0 * 4 * 0 * 50331648 * * 2 * 1 * 13 * 0 * 0 * 50331648 * * 2 * 1 * 606 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 575 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 36 * 2.82590e+16 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * * 2 * 1 * 529 * 1.15294e+19 * 0 * 50331648 * 212135 * * 1 * 1 * 3340 * 1057 * 1.399999976 *
If we assume that none of the actual data fields contain asterisks themselves, the easiest way to read each row is to use a regular expression to split out the lines.
To output, I'd still use the csv
module , because that would make future processing that much easier:
import csv
import re
from itertools import islice
row_split = re.compile('\s*\*\s*')
with open(someinputfile, 'rb') as infile, open(outputfile, 'wb') as outfile:
writer = csv.writer(outfile, delimiter='\t')
next(islice(infile, 3, 3), None) # skip the first 3 lines in the input file
for line in infile:
row = row_split.split(line)[1:-1]
if not row: continue
writer.writerow(row[8::7])
This skips empty rows, and writes only every 7th column (counting from number nine) and skips the rest.
The first row thus is:
['5.76460e+18', '5.83666e+18', '5.83666e+18', '5.76460e+18', '5.83666e+18', '5.83666e+18', '5.83666e+18', '5.83666e+18', '3340']
This is removing empty lines:
filtered = filter(lambda x: not re.match(r'^\s*$', x), original)
To remove a specific column (I assume your data is stored in a text file):
f = open("textfile.txt","r")
lines = f.readlines()
f.close()
f = open("newfile.txt","w")
Write your lines back, except the lines you want to delete:
list = [0, 1, 6, 13, 20] # remove first,second as well as 7th, 14th and 21th line
for i,line in enumerate(lines):
if i not in list:
f.write(line)
At the end, close the file again.
f.close()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.