简体   繁体   中英

Extracting data from .txt and writing to .txt with Python

I am trying to figure out how to code the following problem using python. Suppose we have the following data set in a .txt file:

datatype1 designator1 3:45:14AM
datatype1 designator1 3:45:19AM
datatype1 designator1 3:45:26AM
datatype1 designator1 3:45:31AM
datatype1 designator1 4:10:05AM
datatype1 designator1 4:10:21AM
datatype1 designator1 4:10:30AM
datatype1 designator1 4:10:46AM

Note the time break. I need my code to read through the text file and, where there is a break in the time intervals, split the file up and write the following to another text file:

datatype1 designator1 3:45:14AM 3:45:31AM
datatype1 designator1 4:10:05AM 4:10:46AM

In other words, I want to condense the original data to individual "sessions" represented by single lines with start and end times.

Thanks for your help!

Perform the following steps:

  • Parse each line, extract the time
  • From each time, convert it to a date/time structure
  • Check against the previous date/time structure (if any)
  • If the difference is bigger than some predefined value, start a new file
  • write the complete line

You can use itertools.groupby :

import itertools
file_data = [i.strip('\n').split() for i in open('filename.txt')]
final_data = [(a, list(b)) for a, b in itertools.groupby(file_data, key=lambda x:':'.join(x[-1].split(':')[:2]))]
new_final_data = [' '.join([' '.join(b[0][:-1]), ' '.join([b[0][-1], b[-1][-1]])]) for _, b in final_data]
print(new_final_data)
with open('filename.txt', 'a') as f:
   f.write('\n'.join(new_final_data))

Output:

['datatype1 designator1 3:45:14AM 3:45:31AM', 'datatype1 designator1 4:10:05AM 4:10:46AM']

Using pandas this task becomes more readable:

import pandas as pd
import io

data = '''\
datatype1 designator1 3:30:14AM
datatype1 designator1 3:30:18AM
datatype1 designator1 3:45:14AM
datatype1 designator1 3:45:19AM
datatype1 designator1 3:45:26AM
datatype1 designator1 3:45:31AM
datatype1 designator1 4:10:05AM
datatype1 designator1 4:10:21AM
datatype1 designator1 4:10:30AM
datatype1 designator1 4:10:46AM'''


# Recreate dataset
df = pd.read_csv(io.StringIO(data),sep='\s+', header=None)

# Use this instead of above for real file
#df = pd.read_csv('path/to/file',sep='\s+', header=None)

# Get first and last by hour (convert to dt)
df[2] = sorted(pd.to_datetime(df[2]))
newdf = df.groupby((df[2].dt.hour, df[2].dt.minute // 15)).agg(['first', 'last'])

# Rename columns and drop duplicates
newdf.columns = list(range(len(newdf.columns)))
newdf.drop(newdf.columns[[1,3]], axis=1, inplace=True)

# Format time
newdf[[4,5]] = newdf[[4,5]].apply(lambda x: x.dt.strftime('%#H:%M:%S%p'))

# Output
print(newdf.to_csv('output.csv', index=False, header=False, sep=' '))

output.csv:

datatype1 designator1 3:30:14AM 3:30:18AM
datatype1 designator1 3:45:14AM 3:45:31AM
datatype1 designator1 4:10:05AM 4:10:46AM

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM