How to create dataframe from different txt files in python?

Question

I would like to be able to create a dataframe from several txt files that I have in a folder on the Desktop, the folder path is: "C:\Users\luca\Desktop\traceroute". This is the image of my files inside the folder:

These are two of the N files that are inside the folder:

Example H11H12Trace.txt:

traceroute to 10.0.12.100 (10.0.12.100), 30 hops max, 60 byte packets
 1  10.0.11.10  0.034 ms  0.007 ms  0.006 ms  0.005 ms  0.005 ms  0.005 ms 
 2  10.0.1.2  0.017 ms  0.009 ms  0.008 ms  0.008 ms  0.040 ms  0.017 ms   
 3  10.0.12.100  0.026 ms  0.016 ms  0.018 ms  0.014 ms  0.026 ms  0.018 ms

Example H13H34Trace.txt:

traceroute to 10.0.34.100 (10.0.34.100), 30 hops max, 60 byte packets
 1  10.0.13.10  0.036 ms  0.007 ms  0.005 ms  0.006 ms  0.005 ms  0.005 ms
 2  10.0.1.17  0.017 ms  0.008 ms  0.008 ms  0.008 ms  0.008 ms  0.018 ms
 3  10.0.1.14  0.020 ms  0.011 ms  0.011 ms  0.011 ms  0.020 ms  0.016 ms
 4  10.0.6.6  0.031 ms  0.023 ms  0.023 ms  0.014 ms  0.013 ms  0.012 ms
 5  10.0.7.6  0.023 ms  0.016 ms  0.016 ms  0.015 ms  0.015 ms  0.015 ms
 6  10.0.3.5  0.026 ms  0.018 ms  0.022 ms  0.018 ms  0.040 ms  0.019 ms
 7  10.0.3.2  0.028 ms  0.021 ms  0.020 ms  0.020 ms  0.021 ms  0.020 ms
 8  10.0.34.100  0.030 ms  0.021 ms  0.021 ms  0.020 ms  0.022 ms 0.020 ms

This is my code:

from collections import defaultdict
from pathlib import Path
import pandas as df

my_dir_path = "C:\Users\luca\Desktop\traceroute"

results = defaultdict(list)
for file in Path(my_dir_path).iterdir():
    with open(file, "r") as file_open:
        results["text"].append(file_open.read())
df = pd.DataFrame(results)

df is:

The dataframe I would like is similar to this.

   Source     Destination  Path_Router  Individuals  Path_Avg_Delay  Path_Individuals_Delay 
    
  10.0.11.10  10.0.12.100   [10.0.11.10, [(10.0.11.10   [0.0103,      [0.0266,0.0359]
                             10.0.1.2,    ,10.0.1.2),    0.0163,
                             10.0.12.100] (10.0.1.2,     0.0196]
                                          10.0.12.100)]  
  10.0.13.10  10.0.34.100   ...........   ..............  ........    ...................

to create "Path_Router" are the individual routes of the routers.

to create "Individuals" is the coupling between the first router and the second router, then between the second router and the third router, then between the third router and the fourth and so on

to create the "Path_Avg_Delay" I would like to average the single rows of the routers, for example of 10.0.11.10 it will have the average of the 6 delays 0.034 + 0.007 + 0.006 + 0.005 + 0.005 + 0.005/6 =0.0103 and so on

to create the "Path_Individuals_Delay" I would like to make the sum of the averages between the delays of the different individuals ie: 0.0103 + 0.0163 = 0.0266, 0.0163 + 0.0192 = 0.0359

Unfortunately i don't have much experience with Python yet, I hope you can do something

Thank you so much

Answer 1

# create dataframe you want for populating later
columns = ['Source', 'Destination', 'Path_Router', 'Individuals', 'Path_Avg_Delay', 'Path_Individuals_Delay']
df_master = pd.DataFrame(columns=columns)

# depending on how you cleanse your data upfront, you can use the code close to as is or you'll need to adapt to your input

d = """
1 10.0.11.10 0.034 0.007 0.006 0.005 0.005 0.005
2 10.0.1.2 0.017 0.009 0.008 0.008 0.040 0.017
3 10.0.12.100 0.026 0.016 0.018 0.014 0.026 0.018"""

for file in files:
    df1 = pd.read_csv(io.StringIO(d), sep=' ', header=None) #<<<<<<< your cleansed data here

    Path_Router =df1[1].tolist() #<<<<<<< ip column
    Source = Path_Router[0] # first element
    Destination = Path_Router[-1] # last element

    Individuals = []
    for i, ip in enumerate(Path_Router[:-1]): # iterate through list; stop before last element
        Individuals.append((Path_Router[i], Path_Router[i+1]))

    Path_Avg_Delay = []
    for i, row in df1.iterrows():
        Path_Avg_Delay.append(row.iloc[2:8].mean()) # columns with delay data

    Path_Individuals_Delay = []
    for i, ip in enumerate(Path_Avg_Delay[:-1]): # iterate through list; stop before last element
        Path_Individuals_Delay.append((Path_Avg_Delay[i] + Path_Avg_Delay[i+1]))

    data_list = [Source, Destination, Path_Router, Individuals, Path_Avg_Delay, Path_Individuals_Delay]
    df_master.loc[len(df_master)] = data_list # add list to bottom of dataframe

Adding some guidance based on creating the dataframe upfront as you've done. You'll need to pass that column to a function (check out apply() or transform(), not sure which would work best). So rework the code as follows and try it. Of course you'll have lots of tweaking as the data still contains all the 'ms'. I suggest stripping out in the first part of the function. i'll leave that to you. This by no means will work out of the gate, but should get you to where you need to be.

def transform_data(d):
    # d is the row element being passed in by apply()
    # you're getting the data string now and you need to massage into df1
    # if your main df is called df, i think you can write directly to it, if not you can create a separate df and then merge the two at the end

    df1 = pd.read_csv(io.StringIO(d), sep=' ', header=None) #<<<<<<< your cleansed data here

    Path_Router =df1[1].tolist() #<<<<<<< ip column
    Source = Path_Router[0] # first element
    Destination = Path_Router[-1] # last element
    df['Path_Router'] = Path_Router
    df['Destination'] = Destination
    df['Source'] = Source

    Individuals = []
    for i, ip in enumerate(Path_Router[:-1]): # iterate through list; stop before last element
        Individuals.append((Path_Router[i], Path_Router[i+1]))
    df['Individuals'] = Individuals


    Path_Avg_Delay = []
    for i, row in df1.iterrows():
        Path_Avg_Delay.append(row.iloc[2:8].mean()) # columns with delay data
    df['Path_Avg_Delay'] = Path_Avg_Delay

    Path_Individuals_Delay = []
    for i, ip in enumerate(Path_Avg_Delay[:-1]): # iterate through list; stop before last element
        Path_Individuals_Delay.append((Path_Avg_Delay[i] + Path_Avg_Delay[i+1]))
    df['Path_Individuals_Delay'] = Path_Individuals_Delay

like i said, you may get some errors when trying to write the df out of the apply statement, so you may have work that out.

i believe you can call the function like this. Your text series will be passed in row by row to the function. Updating the df you are reading from is probably bad form so again, think about writing to df_master and merging to original df.

df = df.apply(lambda x: transform_data(x['text']))

I'm not terribly familiar with df.transform() but this could be investigated as well.

Answer 2

Before using regex you can split read the files with:

with open('filename', 'r') as f: 
     file = f.read()

this method manages freeing up the memory after reading the file after that you can process your files by reading lines and splitting text or by using regex but i don't suggest using regex in this case

How to create dataframe from different txt files in python?

Question

2 answers

solution1
1 ACCPTED 2020-12-14 20:05:40

solution2
0 2020-12-14 19:57:08

How to create dataframe from different txt files in python?

Question

2 answers

solution1 1 ACCPTED 2020-12-14 20:05:40

solution2 0 2020-12-14 19:57:08

solution1
1 ACCPTED 2020-12-14 20:05:40

solution2
0 2020-12-14 19:57:08