Reformat text file into dataframe

Question

I'm looking to reformat a text file into a dataframe. The input file would look like this. Each "insert_machine:" value would represent a new record within the dataframe.


/* ----------------- REAL-001  ------------------- */ 

insert_machine: REAL-001
type: a
factor: 1.00  
description: Cloud Added
port: 1234
node_name: REAL-001.some.domain
agent_name: REAL-001
/* key_to_agent: *** masked value ***/
encryption_type: AES
opsys: linux
character_code: ASCII


/* ----------------- REAL-002  ------------------- */ 

insert_machine: REAL-002
type: a
factor: 1.00  
description: Cloud Added
port: 1234
node_name: REAL-002.some.domain 
agent_name: REAL-002
/* key_to_agent: *** masked value ***/
encryption_type: AES
opsys: linux
character_code: ASCII


/* ----------------- VIRTUAL-001 ----------------- */ 

insert_machine: VIRTUAL-001
type: v
machine: REAL-001
factor: ----
machine: REAL-002
factor: ----

My current code is this –

import pandas as pd
jilFileName = "inputfile.txt"

# Create empty list
jilinArray = []
# Create empty dictionary
oneJob = {}
with open(jilFile_path, "rt") as jil:
    jilLines = jil.readlines()
    for linesInJill in jilLines:
        if "insert_machine:" in linesInJill:
            jilinArray.append(oneJob)
            linesInJill = linesInJill.strip()
            machine = linesInJill.split("insert_machine:")[1]
            oneJob = {}
            oneJob["insert_machine"] = str(machine).strip()

        else:
            if linesInJill != "\n" and "/* ----" not in linesInJill:
                if ": " in linesInJill:
                    spli = linesInJill.split(":", 1)
                    oneJob[str(spli[0]).strip()] = str(spli[1]).strip().replace("\"", "")
    jilinArray.append(oneJob)

df = pd.DataFrame(jilinArray, columns=['insert_machine', 'type', 'description', 'port', 'node_name', 'agent_name',
                                       'encrption_type', 'opsys', 'character_code', 'machine'])

print(df)

Which gives me this output –

  insert_machine type  description  ...  opsys character_code   machine
0            NaN  NaN          NaN  ...    NaN            NaN       NaN
1       REAL-001    a  Cloud Added  ...  linux          ASCII       NaN
2       REAL-002    a  Cloud Added  ...  linux          ASCII       NaN
3    VIRTUAL-001    v          NaN  ...    NaN            NaN  REAL-002

My issue is those "insert_machine:" entries that have a " type: v". They could have zero to many "machine:" values. I'm not sure how to get each of those reflected in my dataframe.

I'd like to see something like this -

  insert_machine type  description  ...  opsys character_code   machine
0            NaN  NaN          NaN  ...    NaN            NaN       NaN
1       REAL-001    a  Cloud Added  ...  linux          ASCII       NaN
2       REAL-002    a  Cloud Added  ...  linux          ASCII       NaN
3    VIRTUAL-001    v          NaN  ...    NaN            NaN  REAL-001
4    VIRTUAL-001    v          NaN  ...    NaN            NaN  REAL-002

Ultimately I'd like to see this, but if I can at least get all the "machine:" entries within the df I'm hoping I can go from there.

  insert_machine type  description  ...  opsys character_code   machine
0            NaN  NaN          NaN  ...    NaN            NaN       NaN
1       REAL-001    a  Cloud Added  ...  linux          ASCII  VIRTUAL-001
2       REAL-002    a  Cloud Added  ...  linux          ASCII  VIRTUAL-001

Any thoughts on how I can get each of those "machine:" values reflected in my dataframe?

Answer 1

I'm sure there is a far more eloquent way of handling this, but this is what I came up.

My initial code now looks like this -

# Create empty list
jilinArray = []
# Create empty dictionary
oneJob = {}
# Read our input files
with open(jilFile_path, "rt") as jil:
    jilLines = jil.readlines()
    for linesInJill in jilLines:
        if "insert_machine:" in linesInJill:
            linesInJill = linesInJill.strip()
            ins_mach = linesInJill.split("insert_machine:")[1]
            ins_mach_temp = ins_mach
            oneJob = {}
            oneJob["insert_machine"] = str(ins_mach).strip()
            jilinArray.append(oneJob)
        else:
            if linesInJill != "\n" and "/* ----" not in linesInJill:
                if ": " in linesInJill:
                    spli = linesInJill.split(":", 1)
                    oneJob[str(spli[0]).strip()] = str(spli[1]).strip().replace("\"", "")
                    # To allow for virtual agents that have multiple 'machine:' entries
                    if spli[0] == 'type':
                        type_temp = spli[1]
                    if spli[0] == 'machine':
                        jilinArray.append(oneJob)
                        oneJob = {}
                        oneJob["insert_machine"] = ins_mach_temp.strip()
                        oneJob["type"] = type_temp.strip()

# Load the list into a dataframe
df = pd.DataFrame(jilinArray, columns=['insert_machine', 'type', 'description', 'port', 'node_name', 'agent_name',
                                       'encrption_type', 'opsys', 'character_code', 'machine'])

# Remove all duplicate entries.
df.drop_duplicates(inplace=True)

print(df)

Which gives me this output -

  insert_machine type  description  ...  opsys character_code   machine
0       REAL-001    a  Cloud Added  ...  linux          ASCII       NaN
1       REAL-002    a  Cloud Added  ...  linux          ASCII       NaN
2    VIRTUAL-001    v          NaN  ...    NaN            NaN  REAL-001
4    VIRTUAL-001    v          NaN  ...    NaN            NaN  REAL-002

I then added this to merge the entries -

# Copy our dataframe and filter on the 'type' column to only return virtual agents
df2 = df.copy()
df2 = df2[df2['type'].eq('v')]

# Select our desired columns
df2 = df2[['insert_machine', 'machine']]

# Rename some columns
df2.rename(columns={'insert_machine': 'Virtual_Machine'}, inplace=True)

# Merge the original dataframe(df) with the copied dataframe(df2). To combine the real and virtual agent names into
# one record.
df_mg = pd.merge(df, df2,
                 left_on=df["insert_machine"].str.lower(),
                 right_on=df2["machine"].str.lower(),
                 how='left')

# Rename some columns
df_mg.rename(columns={'insert_machine': 'Real_Machine'}, inplace=True)

# Select our desired columns
df_mg = df_mg[['Virtual_Machine', 'Real_Machine', 'type', 'description', 'port', 'node_name', 'agent_name',
               'encrption_type', 'opsys', 'character_code']]

# Filter on the 'type' column
df_mg = df_mg[df_mg['type'].eq('a')]

print(df_mg)

Which gives me this output -

  Virtual_Machine Real_Machine type  ... encrption_type  opsys character_code
0     VIRTUAL-001     REAL-001    a  ...            NaN  linux          ASCII
1     VIRTUAL-001     REAL-002    a  ...            NaN  linux          ASCII

It seems to be working for me.

Reformat text file into dataframe

Question

1 answers

solution1
0 2022-08-26 12:59:47

Reformat text file into dataframe

Question

1 answers

solution1 0 2022-08-26 12:59:47

solution1
0 2022-08-26 12:59:47