简体   繁体   English

使用Python(或R)将文本数据提取到有意义的表中进行分析

[英]Extracting text data into a meaningful table for analysis using Python (or R)

I'm working on an engineering project in which I'm using machine performance data from archives. 我正在从事一个工程项目,我正在使用来自档案的机器性能数据。 The machine produces one data set approximately every 5s and this data is then available date-wise in a number of .txt files with each file containing data in the following format. 机器大约每5秒产生一个数据集,然后这个数据在许多.txt文件中按日期提供,每个文件包含以下格式的数据。 The data shown below is from the 2013_04_17.txt file which has all the performance data for that particular date. 下面显示的数据来自2013_04_17.txt文件,该文件包含该特定日期的所有性能数据。

2013-04-27 00:00:05.011
V_1 100  V_2 26695  V_3 33197  V_4 c681  V_5  29532
V_6 4600  V_7 4606  V_8 4f55  V_9 5a  V_10  8063  V_11  4300  V_12  4700
V_13 4504  V_14 4400  V_15 4202  V_16 255  V_17  4300  V_18  91  V_19  6f
V_20 300  V_21 14784 
V_22 5.085  V_23 7.840  V_24 -8.061  V_25 36.961

2013-04-27 00:00:10.163
V_1 100  V_2 26695  V_3 33199  V_4 c681  V_5  29872
V_6 4600  V_7 4606  V_8 4f55  V_9 5a  V_10  8063  V_11  4300  V_12  4700
V_13 4504  V_14 4400  V_15 4202  V_16 255  V_17  4300  V_18  91  V_19  6f
V_20 300  V_21 14790 
V_22 5.085  V_23 7.840  V_24 -8.061  V_25 37.961

.......... ..........

I need to view this data in a tabular format or as a CSV in order to be able to produce performance plots and detect any anomalies. 我需要以表格格式或CSV格式查看此数据,以便能够生成性能图并检测任何异常。 However, I do not have enough experience with programming in Python to be able to parse this text file. 但是,我没有足够的Python编程经验来解析这个文本文件。

I've looked into pandas and Regular Expressions for some ideas but have been failing to achieve the desired result and I'm hoping to have a data in a tabular form or a CSV file with the header as variables Date, Time, V_1 , V_2 , V_3 , etc and the subsequent rows as all the values obtained every 5s. 我已经查看了大熊猫和正则表达式的一些想法,但未能达到预期的结果,我希望以表格形式或CSV文件的形式将数据作为变量Date,Time, V_1V_2V_3等以及随后的行作为每5秒获得的所有值。

Edit : you can achieve same results without regex as follows: note, we assume that file format is the same all time, so we are expecting date and time at the beginning of the file 编辑 :您可以在没有正则表达式的情况下获得相同的结果,如下所示:注意,我们假设文件格式始终相同,因此我们期望文件开头的日期和时间

# reading data from a file for example log.txt
with open('log.txt', 'r') as f:
    data = f.read()

data = string.split()
v_readings = dict()
v_readings['date'] = data.pop(0)
v_readings['time' ]= data.pop(0)

i=0
while i < len(data):
    v_readings[data[i]] = data[i+1]
    i += 2

exporting to csv file: 导出到csv文件:

csv = '\n'
csv += ','.join(v_readings.keys())
csv += '\n'
csv += ','.join(v_readings.values())

print(csv)
with open('out.csv', 'w') as f:
    f.write(csv)

output: 输出:

date,time,V_1,V_2,V_3,V_4,V_5,V_6,V_7,V_8,V_9,V_10,V_11,V_12,V_13,V_14,V_15,V_16,V_17,V_18,V_19,V_20,V_21,V_22,V_23,V_24,V_25
2013-04-27,00:00:05.011,100,26695,33197,c681,29532,4600,4606,4f55,5a,8063,4300,4700,4504,4400,4202,255,4300,91,6f,300,14784,5.085,7.840,-8.061,36.961

with regex: This is how you extract these data using regex in variables and dictionary in python 使用正则表达式:这是使用python中的变量和字典中的正则表达式提取这些数据的方法

this is a start point and then you can do whatever you like with them afterwords 这是一个起点,然后你可以随心所欲地做任何你喜欢的事

import re 

string = """
2013-04-27 00:00:05.011 V_1 100 V_2 26695 V_3 33197 V_4 c681 V_5 29532 V_6 4600 V_7 4606 V_8 4f55 V_9 5a V_10 8063 V_11 4300 V_12 4700 V_13 4504 V_14 4400 V_15 4202 V_16 255 V_17 4300 V_18 91 V_19 6f V_20 300 V_21 14784 V_22 5.085 V_23 7.840 V_24 -8.061 V_25 36.961
"""
# extract date 
match = re.search(r'\d{4}-\d\d-\d\d', string)
my_date = match.group()

# extract time
match = re.search(r'\d\d:\d\d:\d\d\.\d+', string)
my_time = match.group()

#getting V's into a dictionary
match = re.findall(r'V_\d+ \d+', string)
v_readings = dict()
for item in match:
    k, v = item.split()
    v_readings[k] = v

# print output
print(my_date)
print(my_time)
print(v_readings)

output: 输出:

2013-04-27
00:00:05.011
{'V_1': '100', 'V_2': '26695', 'V_3': '33197', 'V_5': '29532', 'V_6': '4600', 'V_7': '4606', 'V_8': '4', 'V_9': '5', 'V_10': '8063', 'V_11': '4300', 'V_12': '4700', 'V_13': '4504', 'V_14': '4400', 'V_15': '4202', 'V_16': '255', 'V_17': '4300', 'V_18': '91', 'V_19': '6', 'V_20': '300', 'V_21': '14784', 'V_22': '5', 'V_23': '7', 'V_25': '36'}

You can start by reading the tokens one at a time from the file: 您可以从文件中一次读取一个令牌开始:

with open('2013_04_17.txt') as infile:
    for line in infile:
        for token in line.split():
            print(token)

After that you just need to create a state machine to remember which section you're in, and process each section when you find its end: 之后,您只需要创建一个状态机来记住您所在的部分,并在找到结束时处理每个部分:

def process_record(timestamp, values):
    """print CSV format"""
    print(','.join([timestamp] + values))

with open('t.txt') as infile:
    timestamp = None
    values = []
    for line in infile:
        line = line.strip()
        if timestamp is None:
            timestamp = line
        elif not line: # blank line is separator
            process_record(timestamp, values)
            timestamp = None
            values = []
        else:
            values.extend(line.split()[1::2])
    if timestamp is not None: # process last record, no separator after it
        process_record(timestamp, values)

That gives you CSV output: 这给你CSV输出:

2013-04-27 00:00:05.011,100,26695,33197,c681,29532,4600,4606,4f55,5a,8063,4300,4700,4504,4400,4202,255,4300,91,6f,300,14784,5.085,7.840,-8.061,36.961
2013-04-27 00:00:10.163,100,26695,33199,c681,29872,4600,4606,4f55,5a,8063,4300,4700,4504,4400,4202,255,4300,91,6f,300,14790,5.085,7.840,-8.061,37.961

There's a much easier way. 有一个更简单的方法。 Assuming this data appears in columns in the .txt file (ie the data is in a Fixed-Width Format ), you can use the pandas function pandas.read_fwf() and pass in tuples containing the extents of the fixed-width fields of each line. 假设此数据出现在.txt文件的列中(即数据是固定宽度格式 ),您可以使用pandas函数pandas.read_fwf()并传入包含每个固定宽度字段范围的元组。线。

import pandas

colspecs = [(0,10), (11, 23), (28,31), (37, 42), (48, 54), (59, 63), (70, 75), ...]
data = pandas.read_fwf(TXT_PATH, colspecs = colspecs, header=None)
data.columns = ['date', 'time', 'V_1', 'V_2', 'V_3', 'V_4', 'V_5', ...]
print(data)

         date          time  V_1    V_2    V_3   V_4    V_5
0  2013-04-27  00:00:05.011  100  26695  33197  c681  29532
1  2013-04-27  00:00:10.163  100  26695  33199  c681  29872

And from there, you can save that formatted data to file with the command 从那里,您可以使用该命令将格式化的数据保存到文件中

data.to_csv('filename.csv', index=False)

In R, and this would be very specific to your case you can try tossing all the .txt files into a new folder, for example call it date_data. 在R中,这将非常特定于您的情况,您可以尝试将所有.txt文件放入一个新文件夹,例如将其称为date_data。 Assuming all the files are in this same format try running this. 假设所有文件都采用相同的格式,请尝试运行此文件。

library(purrr)
library(tidyverse)

setwd(./date_data)
odd_file_reader <- function(x){
  as.data.frame(matrix(scan(x, what="character", sep=NULL), ncol = 52, byrow = TRUE)[,-seq(3,51,2)])
}

binded_data <- tibble(filenames = list.files()) %>%
  mutate(yearly_sat = map(filenames, odd_file_reader)) %>%
  unnest()

try my simple code, i used pandas 试试我的简单代码,我用过熊猫

import pandas as pd

with open('2013_04_17.txt', 'r') as f:
    large_list = [word for line in f for word in line.split() if 'V_' not in word]
    print(large_list)
    col_titles = ('date','time','v1','v2','vN','vN','vN','vN','vN','vN','vN','vN'
                  ,'vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN','vN')
    data = pd.np.array(large_list).reshape((len(large_list) // 27, 27))
    pd.DataFrame(data, columns=col_titles).to_csv("output3.csv", index=False) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM