im trying to get my data from a .txt separated into a dataframe. the text file looks like this
29.10.2021 21:25:07 -- TestJob_2-34_AV1__29.10.2021 21:24:22 Start Job: TestJob_2-34_AV1.ldt;
29.10.2021 21:24:22 SUV;
Stoolname: Sky;
Times: 6;
Min/Max: Max;
Stool: Sky_Clean;
Stoolname: T_123;
Times: 6;
Min/Max: Max;
Stool: To150;
Stoolname: T_123-Clean;
Times: 0;
Min/Max: Max;
Stool: C_120um;
Stoolname: T_1234;
Times: 1;
Min/Max: Max;
Stool: qi_mik;
Stoolname: T_1234-Clean;
Times: 1;
Min/Max: Max;
Stool: qi_mikk;
Stoolname: T_1234567;
Times: 1;
Min/Max: Max;
Stool: TCu;
Stoolname: T_1234567-Clean;
Times: 0;
Min/Max: Max;
Stool: ChumCu;
Stoolname: Thte;
Times: 10;
Min/Max: Max;
Stool: qi_mik;
Stoolname: T30-Clean;
Times: 10;
Min/Max: Max;
Stool: qi_mik;
29.10.2021 21:24:22 C2;
Stoolname: T_1234;
Times: 9;
Min/Max: Max;
Stool: Testabc;
Number: 7;
Stoolname: T_1234-1;
Times: 4;
Min/Max: Max;
Stool: Testabcd;
Number: 7;
Stoolname: T_123;
Times: 3;
Min/Max: Max;
Stool: Testabcde;
Number: 7;
29.10.2021 21:27:13 -- TestJob2_2-34_AV1__ Z-Value = 2.27;
S01DATE29.10.2021 21:26:13SCALEX 1,00102Y 1,00022;
Z-Value CleanProcess = 4,27;
S01DATE29.10.2021 21:26:51SCALEX 1,00102Y 1,00022;
29.10.2021 21:27:13End Job: e:\ActLaserProgram\TestJob2_2-34_AV1.ldt;
29.10.2021 21:25:07 -- TestJob_2-34_AV1__29.10.2021 21:24:22 Start Job: TestJob_2-34_AV1.ldt;
29.10.2021 21:24:22 SUV;
Stoolname: Sky;
Times: 6;
Min/Max: Max;
MStool: Sky_Clean;
Stoolname: T_123;
Times: 6;
Min/Max: Max;
Stool: To150;
Stoolname: T_123-Clean;
Times: 0;
Min/Max: Max;
Stool: C_120um;
Stoolname: T_1234;
Times: 1;
Min/Max: Max;
Stool: qi_mik;
Stoolname: T_1234-Clean;
Times: 1;
Min/Max: Max;
Stool: qi_mikk;
Stoolname: T_1234567;
Times: 1;
Min/Max: Max;
Stool: TCu;
Stoolname: T_1234567-Clean;
Times: 0;
Min/Max: Max;
Stool: ChumCu;
Stoolname: Thte;
Times: 10;
Min/Max: Max;
Stool: qi_mik;
Stoolname: T30-Clean;
Times: 10;
Min/Max: Max;
Stool: qi_mik;
29.10.2021 21:24:22 C2;
Stoolname: T_1234;
Times: 9;
Min/Max: Max;
Stool: Testabc;
Number: 7;
Stoolname: T_1234-1;
Times: 4;
Min/Max: Max;
Stool: Testabcd;
Number: 7;
Stoolname: T_123;
Times: 3;
Min/Max: Max;
Stool: Testabcde;
Number: 7;
Stoolname: T_1234567;
Times: 3;
Min/Max: Max;
Stool: Testabcde;
Number: 7;
29.10.2021 21:27:13 -- TestJob2_2-34_AV1__ Z-Value = 2.27;
S01DATE29.10.2021 21:26:13SCALEX 1,00102Y 1,00022;
Z-Value CleanProcess = 4,27;
S01DATE29.10.2021 21:26:51SCALEX 1,00102Y 1,00022;
29.10.2021 21:27:13End Job: e:\ActLaserProgram\TestJob2_2-34_AV1.ldt;
this is a shortend version of what im looking for. i tried to read the txt line by line but the text file is long and it took too long! im stumped!
Job | Program | Type | Stoolname | times | Min/Max | Stool | Number |
---|---|---|---|---|---|---|---|
TestJob | TestJob_2-34 | SUV | Sky | 6 | Max | Sky_Clean | NaN |
TestJob | TestJob_2-34 | SUV | T_123 | 6 | Max | To150 | NaN |
........ | ............. | .... | ......... | ..... | ......... | .............. | ...... |
TestJob | TestJob_2-34 | C2 | T_1234 | 9 | Max | Testabc | 7 |
the job and program number are hidden in the line that contains "Start Job"
thank you
Here you go! It's really fast.
For a file with ~11 million lines (made by copying and pasting your sample file over and over again), it took about 22 seconds on my machine, and produced a dataframe with 2.2 million rows.
Note: I wasn't sure quite how to handle the Program
column, because in your expected dataframe non of the values in it end with _AV
, but your text file does, and I wasn't sure what your rules are regarding that.
import pandas as pd
import json
import re
from numpy import nan
file = 'test.txt'
with open(file) as f:
lines = f.readlines()
# To store the final data before feeding it to the dataframe
dct = {
'Job': [],
'Program': [],
'Type': [],
'Stoolname': [],
'Times': [],
'Min/Max': [],
'Stool': [],
'Number': [],
}
# To keep track of missing values
counts = {}
field_re = re.compile(r'^[a-z/]+:', re.IGNORECASE)
type_change_re = re.compile(r'^[\d\.: -]+\w+$', re.IGNORECASE)
# This will keep a list of names of keys that we've encountered since this item
# started. We need this because there is no delimiter between objects in the text file.
# (Using a dict like a list here, because dicts are much faster to search
# (their keys) than lists)
hit_fields = {}
# Use a dict like a list here (see above)
special_fields = {
'Job': None,
'Program': None,
'Type': None,
}
last_type = ''
last_job = ''
last_program = ''
for line in lines:
line = line.strip().strip(';')
if field_re.search(line) is not None:
k, v = line.split(': ')
if k in hit_fields:
# We've found a new item. Add all the accumulated fields to dct
for field in hit_fields:
dct[field].append(hit_fields[field])
for field in dct:
if field not in hit_fields and field not in special_fields:
dct[field].append(nan)
hit_fields = {}
dct['Job'].append(last_job)
dct['Program'].append(last_program)
dct['Type'].append(last_type)
hit_fields[k] = v
elif ' Start Job: ' in line:
# Change Job and Program
job = line.split(' Start Job: ')[1]
if job.endswith('.ldt'):
job = job[:-4]
last_job = job.split('_')[0]
last_program = job
elif type_change_re.match(line) is not None:
# Change Type
last_type = line.split(' ')[-1]
# Finish (sorry for the duplicated code here, I couldn't figure out how to optimize it)
for field in hit_fields:
dct[field].append(hit_fields[field])
for field in dct:
if field not in hit_fields and field not in special_fields:
dct[field].append(nan)
dct['Job'].append(last_job)
dct['Program'].append(last_program)
dct['Type'].append(last_type)
######################################
# Save it to a file:
with open('data.json', 'w') as f:
json.dump(dct, f)
# Or load it into a dataframe
df = pd.DataFrame(dct)
print(df)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.