將文本文件重新格式化為 dataframe

Question

我希望將文本文件重新格式化為 dataframe。 輸入文件看起來像這樣。 每個“insert_machine:”值將代表 dataframe 中的一條新記錄。


/* ----------------- REAL-001  ------------------- */ 

insert_machine: REAL-001
type: a
factor: 1.00  
description: Cloud Added
port: 1234
node_name: REAL-001.some.domain
agent_name: REAL-001
/* key_to_agent: *** masked value ***/
encryption_type: AES
opsys: linux
character_code: ASCII


/* ----------------- REAL-002  ------------------- */ 

insert_machine: REAL-002
type: a
factor: 1.00  
description: Cloud Added
port: 1234
node_name: REAL-002.some.domain 
agent_name: REAL-002
/* key_to_agent: *** masked value ***/
encryption_type: AES
opsys: linux
character_code: ASCII


/* ----------------- VIRTUAL-001 ----------------- */ 

insert_machine: VIRTUAL-001
type: v
machine: REAL-001
factor: ----
machine: REAL-002
factor: ----

我當前的代碼是這樣的——

import pandas as pd
jilFileName = "inputfile.txt"

# Create empty list
jilinArray = []
# Create empty dictionary
oneJob = {}
with open(jilFile_path, "rt") as jil:
    jilLines = jil.readlines()
    for linesInJill in jilLines:
        if "insert_machine:" in linesInJill:
            jilinArray.append(oneJob)
            linesInJill = linesInJill.strip()
            machine = linesInJill.split("insert_machine:")[1]
            oneJob = {}
            oneJob["insert_machine"] = str(machine).strip()

        else:
            if linesInJill != "\n" and "/* ----" not in linesInJill:
                if ": " in linesInJill:
                    spli = linesInJill.split(":", 1)
                    oneJob[str(spli[0]).strip()] = str(spli[1]).strip().replace("\"", "")
    jilinArray.append(oneJob)

df = pd.DataFrame(jilinArray, columns=['insert_machine', 'type', 'description', 'port', 'node_name', 'agent_name',
                                       'encrption_type', 'opsys', 'character_code', 'machine'])

print(df)

這給了我這個 output –

  insert_machine type  description  ...  opsys character_code   machine
0            NaN  NaN          NaN  ...    NaN            NaN       NaN
1       REAL-001    a  Cloud Added  ...  linux          ASCII       NaN
2       REAL-002    a  Cloud Added  ...  linux          ASCII       NaN
3    VIRTUAL-001    v          NaN  ...    NaN            NaN  REAL-002

我的問題是那些具有“類型：v”的“插入機器：”條目。 它們可以有零到多個“機器：”值。 我不確定如何讓我的 dataframe 中的每一個都反映出來。

我想看看這樣的東西——

  insert_machine type  description  ...  opsys character_code   machine
0            NaN  NaN          NaN  ...    NaN            NaN       NaN
1       REAL-001    a  Cloud Added  ...  linux          ASCII       NaN
2       REAL-002    a  Cloud Added  ...  linux          ASCII       NaN
3    VIRTUAL-001    v          NaN  ...    NaN            NaN  REAL-001
4    VIRTUAL-001    v          NaN  ...    NaN            NaN  REAL-002

最終我想看到這個，但如果我至少能得到 df 中的所有“機器：”條目，我希望我能從那里得到 go。

  insert_machine type  description  ...  opsys character_code   machine
0            NaN  NaN          NaN  ...    NaN            NaN       NaN
1       REAL-001    a  Cloud Added  ...  linux          ASCII  VIRTUAL-001
2       REAL-002    a  Cloud Added  ...  linux          ASCII  VIRTUAL-001

關於如何獲得在我的 dataframe 中反映的每個“機器：”值的任何想法？

Answer 1

我確信有更多的 eloquent 方式來處理這個問題，但這就是我想出的。

我的初始代碼現在看起來像這樣 -

# Create empty list
jilinArray = []
# Create empty dictionary
oneJob = {}
# Read our input files
with open(jilFile_path, "rt") as jil:
    jilLines = jil.readlines()
    for linesInJill in jilLines:
        if "insert_machine:" in linesInJill:
            linesInJill = linesInJill.strip()
            ins_mach = linesInJill.split("insert_machine:")[1]
            ins_mach_temp = ins_mach
            oneJob = {}
            oneJob["insert_machine"] = str(ins_mach).strip()
            jilinArray.append(oneJob)
        else:
            if linesInJill != "\n" and "/* ----" not in linesInJill:
                if ": " in linesInJill:
                    spli = linesInJill.split(":", 1)
                    oneJob[str(spli[0]).strip()] = str(spli[1]).strip().replace("\"", "")
                    # To allow for virtual agents that have multiple 'machine:' entries
                    if spli[0] == 'type':
                        type_temp = spli[1]
                    if spli[0] == 'machine':
                        jilinArray.append(oneJob)
                        oneJob = {}
                        oneJob["insert_machine"] = ins_mach_temp.strip()
                        oneJob["type"] = type_temp.strip()

# Load the list into a dataframe
df = pd.DataFrame(jilinArray, columns=['insert_machine', 'type', 'description', 'port', 'node_name', 'agent_name',
                                       'encrption_type', 'opsys', 'character_code', 'machine'])

# Remove all duplicate entries.
df.drop_duplicates(inplace=True)

print(df)

這給了我這個 output -

  insert_machine type  description  ...  opsys character_code   machine
0       REAL-001    a  Cloud Added  ...  linux          ASCII       NaN
1       REAL-002    a  Cloud Added  ...  linux          ASCII       NaN
2    VIRTUAL-001    v          NaN  ...    NaN            NaN  REAL-001
4    VIRTUAL-001    v          NaN  ...    NaN            NaN  REAL-002

然后我添加了這個來合並條目 -

# Copy our dataframe and filter on the 'type' column to only return virtual agents
df2 = df.copy()
df2 = df2[df2['type'].eq('v')]

# Select our desired columns
df2 = df2[['insert_machine', 'machine']]

# Rename some columns
df2.rename(columns={'insert_machine': 'Virtual_Machine'}, inplace=True)

# Merge the original dataframe(df) with the copied dataframe(df2). To combine the real and virtual agent names into
# one record.
df_mg = pd.merge(df, df2,
                 left_on=df["insert_machine"].str.lower(),
                 right_on=df2["machine"].str.lower(),
                 how='left')

# Rename some columns
df_mg.rename(columns={'insert_machine': 'Real_Machine'}, inplace=True)

# Select our desired columns
df_mg = df_mg[['Virtual_Machine', 'Real_Machine', 'type', 'description', 'port', 'node_name', 'agent_name',
               'encrption_type', 'opsys', 'character_code']]

# Filter on the 'type' column
df_mg = df_mg[df_mg['type'].eq('a')]

print(df_mg)

這給了我這個 output -

  Virtual_Machine Real_Machine type  ... encrption_type  opsys character_code
0     VIRTUAL-001     REAL-001    a  ...            NaN  linux          ASCII
1     VIRTUAL-001     REAL-002    a  ...            NaN  linux          ASCII

它似乎對我有用。

將文本文件重新格式化為 dataframe

問題描述

1 個解決方案

解決方案1
0 2022-08-26 12:59:47

將文本文件重新格式化為 dataframe

問題描述

1 個解決方案

解決方案1 0 2022-08-26 12:59:47

解決方案1
0 2022-08-26 12:59:47