Data from text file to dataframe

Question

im trying to get my data from a .txt separated into a dataframe. the text file looks like this

29.10.2021 21:25:07 -- TestJob_2-34_AV1__29.10.2021 21:24:22  Start Job: TestJob_2-34_AV1.ldt;
29.10.2021 21:24:22  SUV;
Stoolname: Sky;
Times: 6;
Min/Max: Max;
Stool: Sky_Clean;
Stoolname: T_123;
Times: 6;
Min/Max: Max;
Stool: To150;
Stoolname: T_123-Clean;
Times: 0;
Min/Max: Max;
Stool: C_120um;
Stoolname: T_1234;
Times: 1;
Min/Max: Max;
Stool: qi_mik;
Stoolname: T_1234-Clean;
Times: 1;
Min/Max: Max;
Stool: qi_mikk;
Stoolname: T_1234567;
Times: 1;
Min/Max: Max;
Stool: TCu;
Stoolname: T_1234567-Clean;
Times: 0;
Min/Max: Max;
Stool: ChumCu;
Stoolname: Thte;
Times: 10;
Min/Max: Max;
Stool: qi_mik;
Stoolname: T30-Clean;
Times: 10;
Min/Max: Max;
Stool: qi_mik;
29.10.2021 21:24:22  C2;
Stoolname: T_1234;
Times: 9;
Min/Max: Max;
Stool: Testabc;
Number: 7;
Stoolname: T_1234-1;
Times: 4;
Min/Max: Max;
Stool: Testabcd;
Number: 7;
Stoolname: T_123;
Times: 3;
Min/Max: Max;
Stool: Testabcde;
Number: 7;


29.10.2021 21:27:13 -- TestJob2_2-34_AV1__ Z-Value = 2.27;
S01DATE29.10.2021 21:26:13SCALEX 1,00102Y 1,00022;
 Z-Value CleanProcess = 4,27;
S01DATE29.10.2021 21:26:51SCALEX 1,00102Y 1,00022;
29.10.2021 21:27:13End Job: e:\ActLaserProgram\TestJob2_2-34_AV1.ldt;

29.10.2021 21:25:07 -- TestJob_2-34_AV1__29.10.2021 21:24:22  Start Job: TestJob_2-34_AV1.ldt;
29.10.2021 21:24:22  SUV;
Stoolname: Sky;
Times: 6;
Min/Max: Max;
MStool: Sky_Clean;
Stoolname: T_123;
Times: 6;
Min/Max: Max;
Stool: To150;
Stoolname: T_123-Clean;
Times: 0;
Min/Max: Max;
Stool: C_120um;
Stoolname: T_1234;
Times: 1;
Min/Max: Max;
Stool: qi_mik;
Stoolname: T_1234-Clean;
Times: 1;
Min/Max: Max;
Stool: qi_mikk;
Stoolname: T_1234567;
Times: 1;
Min/Max: Max;
Stool: TCu;
Stoolname: T_1234567-Clean;
Times: 0;
Min/Max: Max;
Stool: ChumCu;
Stoolname: Thte;
Times: 10;
Min/Max: Max;
Stool: qi_mik;
Stoolname: T30-Clean;
Times: 10;
Min/Max: Max;
Stool: qi_mik;
29.10.2021 21:24:22  C2;
Stoolname: T_1234;
Times: 9;
Min/Max: Max;
Stool: Testabc;
Number: 7;
Stoolname: T_1234-1;
Times: 4;
Min/Max: Max;
Stool: Testabcd;
Number: 7;
Stoolname: T_123;
Times: 3;
Min/Max: Max;
Stool: Testabcde;
Number: 7;
Stoolname: T_1234567;
Times: 3;
Min/Max: Max;
Stool: Testabcde;
Number: 7;


29.10.2021 21:27:13 -- TestJob2_2-34_AV1__ Z-Value = 2.27;
S01DATE29.10.2021 21:26:13SCALEX 1,00102Y 1,00022;
 Z-Value CleanProcess = 4,27;
S01DATE29.10.2021 21:26:51SCALEX 1,00102Y 1,00022;
29.10.2021 21:27:13End Job: e:\ActLaserProgram\TestJob2_2-34_AV1.ldt;

this is a shortend version of what im looking for. i tried to read the txt line by line but the text file is long and it took too long! im stumped!

Job	Program	Type	Stoolname	times	Min/Max	Stool	Number
TestJob	TestJob_2-34	SUV	Sky	6	Max	Sky_Clean	NaN
TestJob	TestJob_2-34	SUV	T_123	6	Max	To150	NaN
........	.............	....	.........	.....	.........	..............	......
TestJob	TestJob_2-34	C2	T_1234	9	Max	Testabc	7

the job and program number are hidden in the line that contains "Start Job"

thank you

Answer 1

Here you go! It's really fast.

For a file with ~11 million lines (made by copying and pasting your sample file over and over again), it took about 22 seconds on my machine, and produced a dataframe with 2.2 million rows.

Note: I wasn't sure quite how to handle the Program column, because in your expected dataframe non of the values in it end with _AV , but your text file does, and I wasn't sure what your rules are regarding that.

import pandas as pd
import json
import re

from numpy import nan

file = 'test.txt'

with open(file) as f:
    lines = f.readlines()

# To store the final data before feeding it to the dataframe
dct = {
    'Job': [],
    'Program': [],
    'Type': [],
    'Stoolname': [],
    'Times': [],
    'Min/Max': [],
    'Stool': [],
    'Number': [],
}

# To keep track of missing values
counts = {}

field_re = re.compile(r'^[a-z/]+:', re.IGNORECASE)
type_change_re = re.compile(r'^[\d\.: -]+\w+$', re.IGNORECASE)

# This will keep a list of names of keys that we've encountered since this item
# started. We need this because there is no delimiter between objects in the text file.
# (Using a dict like a list here, because dicts are much faster to search
# (their keys) than lists)
hit_fields = {}

# Use a dict like a list here (see above)
special_fields = {
    'Job': None,
    'Program': None,
    'Type': None,
}

last_type = ''
last_job = ''
last_program = ''

for line in lines:
    line = line.strip().strip(';')
    if field_re.search(line) is not None:
        k, v = line.split(': ')
        if k in hit_fields:

            # We've found a new item. Add all the accumulated fields to dct
            for field in hit_fields:
                dct[field].append(hit_fields[field])
            for field in dct:
                if field not in hit_fields and field not in special_fields:
                    dct[field].append(nan)
            hit_fields = {}

            dct['Job'].append(last_job)
            dct['Program'].append(last_program)
            dct['Type'].append(last_type)

        hit_fields[k] = v

    elif ' Start Job: ' in line:
        # Change Job and Program
        job = line.split(' Start Job: ')[1]
        if job.endswith('.ldt'):
            job = job[:-4]
        last_job = job.split('_')[0]
        last_program = job

    elif type_change_re.match(line) is not None:
        # Change Type
        last_type = line.split(' ')[-1]

# Finish (sorry for the duplicated code here, I couldn't figure out how to optimize it)
for field in hit_fields:
    dct[field].append(hit_fields[field])
for field in dct:
    if field not in hit_fields and field not in special_fields:
        dct[field].append(nan)
dct['Job'].append(last_job)
dct['Program'].append(last_program)
dct['Type'].append(last_type)

######################################

# Save it to a file:
with open('data.json', 'w') as f:
    json.dump(dct, f)

# Or load it into a dataframe
df = pd.DataFrame(dct)
print(df)

Data from text file to dataframe

Question

1 answers

solution1
1 2021-11-04 21:41:35

Data from text file to dataframe

Question

1 answers

solution1 1 2021-11-04 21:41:35

solution1
1 2021-11-04 21:41:35