I'm working on an engineering project in which I'm using machine performance data from archives. The machine produces one data set approximately every 5s and this data is then available date-wise in a number of .txt
files with each file containing data in the following format. The data shown below is from the 2013_04_17.txt
file which has all the performance data for that particular date.
2013-04-27 00:00:05.011
V_1 100 V_2 26695 V_3 33197 V_4 c681 V_5 29532
V_6 4600 V_7 4606 V_8 4f55 V_9 5a V_10 8063 V_11 4300 V_12 4700
V_13 4504 V_14 4400 V_15 4202 V_16 255 V_17 4300 V_18 91 V_19 6f
V_20 300 V_21 14784
V_22 5.085 V_23 7.840 V_24 -8.061 V_25 36.961
2013-04-27 00:00:10.163
V_1 100 V_2 26695 V_3 33199 V_4 c681 V_5 29872
V_6 4600 V_7 4606 V_8 4f55 V_9 5a V_10 8063 V_11 4300 V_12 4700
V_13 4504 V_14 4400 V_15 4202 V_16 255 V_17 4300 V_18 91 V_19 6f
V_20 300 V_21 14790
V_22 5.085 V_23 7.840 V_24 -8.061 V_25 37.961
..........
I need to view this data in a tabular format or as a CSV in order to be able to produce performance plots and detect any anomalies. However, I do not have enough experience with programming in Python to be able to parse this text file.
I've looked into pandas and Regular Expressions for some ideas but have been failing to achieve the desired result and I'm hoping to have a data in a tabular form or a CSV file with the header as variables Date, Time, V_1
, V_2
, V_3
, etc and the subsequent rows as all the values obtained every 5s.
Edit : you can achieve same results without regex as follows: note, we assume that file format is the same all time, so we are expecting date and time at the beginning of the file
# reading data from a file for example log.txt
with open('log.txt', 'r') as f:
data = f.read()
data = string.split()
v_readings = dict()
v_readings['date'] = data.pop(0)
v_readings['time' ]= data.pop(0)
i=0
while i < len(data):
v_readings[data[i]] = data[i+1]
i += 2
exporting to csv file:
csv = '\n'
csv += ','.join(v_readings.keys())
csv += '\n'
csv += ','.join(v_readings.values())
print(csv)
with open('out.csv', 'w') as f:
f.write(csv)
output:
date,time,V_1,V_2,V_3,V_4,V_5,V_6,V_7,V_8,V_9,V_10,V_11,V_12,V_13,V_14,V_15,V_16,V_17,V_18,V_19,V_20,V_21,V_22,V_23,V_24,V_25
2013-04-27,00:00:05.011,100,26695,33197,c681,29532,4600,4606,4f55,5a,8063,4300,4700,4504,4400,4202,255,4300,91,6f,300,14784,5.085,7.840,-8.061,36.961
with regex: This is how you extract these data using regex in variables and dictionary in python
this is a start point and then you can do whatever you like with them afterwords
import re
string = """
2013-04-27 00:00:05.011 V_1 100 V_2 26695 V_3 33197 V_4 c681 V_5 29532 V_6 4600 V_7 4606 V_8 4f55 V_9 5a V_10 8063 V_11 4300 V_12 4700 V_13 4504 V_14 4400 V_15 4202 V_16 255 V_17 4300 V_18 91 V_19 6f V_20 300 V_21 14784 V_22 5.085 V_23 7.840 V_24 -8.061 V_25 36.961
"""
# extract date
match = re.search(r'\d{4}-\d\d-\d\d', string)
my_date = match.group()
# extract time
match = re.search(r'\d\d:\d\d:\d\d\.\d+', string)
my_time = match.group()
#getting V's into a dictionary
match = re.findall(r'V_\d+ \d+', string)
v_readings = dict()
for item in match:
k, v = item.split()
v_readings[k] = v
# print output
print(my_date)
print(my_time)
print(v_readings)
output:
2013-04-27
00:00:05.011
{'V_1': '100', 'V_2': '26695', 'V_3': '33197', 'V_5': '29532', 'V_6': '4600', 'V_7': '4606', 'V_8': '4', 'V_9': '5', 'V_10': '8063', 'V_11': '4300', 'V_12': '4700', 'V_13': '4504', 'V_14': '4400', 'V_15': '4202', 'V_16': '255', 'V_17': '4300', 'V_18': '91', 'V_19': '6', 'V_20': '300', 'V_21': '14784', 'V_22': '5', 'V_23': '7', 'V_25': '36'}
You can start by reading the tokens one at a time from the file:
with open('2013_04_17.txt') as infile:
for line in infile:
for token in line.split():
print(token)
After that you just need to create a state machine to remember which section you're in, and process each section when you find its end:
def process_record(timestamp, values):
"""print CSV format"""
print(','.join([timestamp] + values))
with open('t.txt') as infile:
timestamp = None
values = []
for line in infile:
line = line.strip()
if timestamp is None:
timestamp = line
elif not line: # blank line is separator
process_record(timestamp, values)
timestamp = None
values = []
else:
values.extend(line.split()[1::2])
if timestamp is not None: # process last record, no separator after it
process_record(timestamp, values)
That gives you CSV output:
2013-04-27 00:00:05.011,100,26695,33197,c681,29532,4600,4606,4f55,5a,8063,4300,4700,4504,4400,4202,255,4300,91,6f,300,14784,5.085,7.840,-8.061,36.961
2013-04-27 00:00:10.163,100,26695,33199,c681,29872,4600,4606,4f55,5a,8063,4300,4700,4504,4400,4202,255,4300,91,6f,300,14790,5.085,7.840,-8.061,37.961
There's a much easier way. Assuming this data appears in columns in the .txt file (ie the data is in a Fixed-Width Format ), you can use the pandas function pandas.read_fwf() and pass in tuples containing the extents of the fixed-width fields of each line.
import pandas
colspecs = [(0,10), (11, 23), (28,31), (37, 42), (48, 54), (59, 63), (70, 75), ...]
data = pandas.read_fwf(TXT_PATH, colspecs = colspecs, header=None)
data.columns = ['date', 'time', 'V_1', 'V_2', 'V_3', 'V_4', 'V_5', ...]
print(data)
date time V_1 V_2 V_3 V_4 V_5
0 2013-04-27 00:00:05.011 100 26695 33197 c681 29532
1 2013-04-27 00:00:10.163 100 26695 33199 c681 29872
And from there, you can save that formatted data to file with the command
data.to_csv('filename.csv', index=False)
In R, and this would be very specific to your case you can try tossing all the .txt files into a new folder, for example call it date_data. Assuming all the files are in this same format try running this.
library(purrr)
library(tidyverse)
setwd(./date_data)
odd_file_reader <- function(x){
as.data.frame(matrix(scan(x, what="character", sep=NULL), ncol = 52, byrow = TRUE)[,-seq(3,51,2)])
}
binded_data <- tibble(filenames = list.files()) %>%
mutate(yearly_sat = map(filenames, odd_file_reader)) %>%
unnest()
try my simple code, i used pandas
import pandas as pd
with open('2013_04_17.txt', 'r') as f:
large_list = [word for line in f for word in line.split() if 'V_' not in word]
print(large_list)
col_titles = ('date','time','v1','v2','vN','vN','vN','vN','vN','vN','vN','vN'
,'vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN')
data = pd.np.array(large_list).reshape((len(large_list) // 27, 27))
pd.DataFrame(data, columns=col_titles).to_csv("output3.csv", index=False)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.