简体   繁体   中英

Extracting text data into a meaningful table for analysis using Python (or R)

I'm working on an engineering project in which I'm using machine performance data from archives. The machine produces one data set approximately every 5s and this data is then available date-wise in a number of .txt files with each file containing data in the following format. The data shown below is from the 2013_04_17.txt file which has all the performance data for that particular date.

2013-04-27 00:00:05.011
V_1 100  V_2 26695  V_3 33197  V_4 c681  V_5  29532
V_6 4600  V_7 4606  V_8 4f55  V_9 5a  V_10  8063  V_11  4300  V_12  4700
V_13 4504  V_14 4400  V_15 4202  V_16 255  V_17  4300  V_18  91  V_19  6f
V_20 300  V_21 14784 
V_22 5.085  V_23 7.840  V_24 -8.061  V_25 36.961

2013-04-27 00:00:10.163
V_1 100  V_2 26695  V_3 33199  V_4 c681  V_5  29872
V_6 4600  V_7 4606  V_8 4f55  V_9 5a  V_10  8063  V_11  4300  V_12  4700
V_13 4504  V_14 4400  V_15 4202  V_16 255  V_17  4300  V_18  91  V_19  6f
V_20 300  V_21 14790 
V_22 5.085  V_23 7.840  V_24 -8.061  V_25 37.961

..........

I need to view this data in a tabular format or as a CSV in order to be able to produce performance plots and detect any anomalies. However, I do not have enough experience with programming in Python to be able to parse this text file.

I've looked into pandas and Regular Expressions for some ideas but have been failing to achieve the desired result and I'm hoping to have a data in a tabular form or a CSV file with the header as variables Date, Time, V_1 , V_2 , V_3 , etc and the subsequent rows as all the values obtained every 5s.

Edit : you can achieve same results without regex as follows: note, we assume that file format is the same all time, so we are expecting date and time at the beginning of the file

# reading data from a file for example log.txt
with open('log.txt', 'r') as f:
    data = f.read()

data = string.split()
v_readings = dict()
v_readings['date'] = data.pop(0)
v_readings['time' ]= data.pop(0)

i=0
while i < len(data):
    v_readings[data[i]] = data[i+1]
    i += 2

exporting to csv file:

csv = '\n'
csv += ','.join(v_readings.keys())
csv += '\n'
csv += ','.join(v_readings.values())

print(csv)
with open('out.csv', 'w') as f:
    f.write(csv)

output:

date,time,V_1,V_2,V_3,V_4,V_5,V_6,V_7,V_8,V_9,V_10,V_11,V_12,V_13,V_14,V_15,V_16,V_17,V_18,V_19,V_20,V_21,V_22,V_23,V_24,V_25
2013-04-27,00:00:05.011,100,26695,33197,c681,29532,4600,4606,4f55,5a,8063,4300,4700,4504,4400,4202,255,4300,91,6f,300,14784,5.085,7.840,-8.061,36.961

with regex: This is how you extract these data using regex in variables and dictionary in python

this is a start point and then you can do whatever you like with them afterwords

import re 

string = """
2013-04-27 00:00:05.011 V_1 100 V_2 26695 V_3 33197 V_4 c681 V_5 29532 V_6 4600 V_7 4606 V_8 4f55 V_9 5a V_10 8063 V_11 4300 V_12 4700 V_13 4504 V_14 4400 V_15 4202 V_16 255 V_17 4300 V_18 91 V_19 6f V_20 300 V_21 14784 V_22 5.085 V_23 7.840 V_24 -8.061 V_25 36.961
"""
# extract date 
match = re.search(r'\d{4}-\d\d-\d\d', string)
my_date = match.group()

# extract time
match = re.search(r'\d\d:\d\d:\d\d\.\d+', string)
my_time = match.group()

#getting V's into a dictionary
match = re.findall(r'V_\d+ \d+', string)
v_readings = dict()
for item in match:
    k, v = item.split()
    v_readings[k] = v

# print output
print(my_date)
print(my_time)
print(v_readings)

output:

2013-04-27
00:00:05.011
{'V_1': '100', 'V_2': '26695', 'V_3': '33197', 'V_5': '29532', 'V_6': '4600', 'V_7': '4606', 'V_8': '4', 'V_9': '5', 'V_10': '8063', 'V_11': '4300', 'V_12': '4700', 'V_13': '4504', 'V_14': '4400', 'V_15': '4202', 'V_16': '255', 'V_17': '4300', 'V_18': '91', 'V_19': '6', 'V_20': '300', 'V_21': '14784', 'V_22': '5', 'V_23': '7', 'V_25': '36'}

You can start by reading the tokens one at a time from the file:

with open('2013_04_17.txt') as infile:
    for line in infile:
        for token in line.split():
            print(token)

After that you just need to create a state machine to remember which section you're in, and process each section when you find its end:

def process_record(timestamp, values):
    """print CSV format"""
    print(','.join([timestamp] + values))

with open('t.txt') as infile:
    timestamp = None
    values = []
    for line in infile:
        line = line.strip()
        if timestamp is None:
            timestamp = line
        elif not line: # blank line is separator
            process_record(timestamp, values)
            timestamp = None
            values = []
        else:
            values.extend(line.split()[1::2])
    if timestamp is not None: # process last record, no separator after it
        process_record(timestamp, values)

That gives you CSV output:

2013-04-27 00:00:05.011,100,26695,33197,c681,29532,4600,4606,4f55,5a,8063,4300,4700,4504,4400,4202,255,4300,91,6f,300,14784,5.085,7.840,-8.061,36.961
2013-04-27 00:00:10.163,100,26695,33199,c681,29872,4600,4606,4f55,5a,8063,4300,4700,4504,4400,4202,255,4300,91,6f,300,14790,5.085,7.840,-8.061,37.961

There's a much easier way. Assuming this data appears in columns in the .txt file (ie the data is in a Fixed-Width Format ), you can use the pandas function pandas.read_fwf() and pass in tuples containing the extents of the fixed-width fields of each line.

import pandas

colspecs = [(0,10), (11, 23), (28,31), (37, 42), (48, 54), (59, 63), (70, 75), ...]
data = pandas.read_fwf(TXT_PATH, colspecs = colspecs, header=None)
data.columns = ['date', 'time', 'V_1', 'V_2', 'V_3', 'V_4', 'V_5', ...]
print(data)

         date          time  V_1    V_2    V_3   V_4    V_5
0  2013-04-27  00:00:05.011  100  26695  33197  c681  29532
1  2013-04-27  00:00:10.163  100  26695  33199  c681  29872

And from there, you can save that formatted data to file with the command

data.to_csv('filename.csv', index=False)

In R, and this would be very specific to your case you can try tossing all the .txt files into a new folder, for example call it date_data. Assuming all the files are in this same format try running this.

library(purrr)
library(tidyverse)

setwd(./date_data)
odd_file_reader <- function(x){
  as.data.frame(matrix(scan(x, what="character", sep=NULL), ncol = 52, byrow = TRUE)[,-seq(3,51,2)])
}

binded_data <- tibble(filenames = list.files()) %>%
  mutate(yearly_sat = map(filenames, odd_file_reader)) %>%
  unnest()

try my simple code, i used pandas

import pandas as pd

with open('2013_04_17.txt', 'r') as f:
    large_list = [word for line in f for word in line.split() if 'V_' not in word]
    print(large_list)
    col_titles = ('date','time','v1','v2','vN','vN','vN','vN','vN','vN','vN','vN'
                  ,'vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN')
    data = pd.np.array(large_list).reshape((len(large_list) // 27, 27))
    pd.DataFrame(data, columns=col_titles).to_csv("output3.csv", index=False) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM