I would like to be able to create a dataframe from several txt files that I have in a folder on the Desktop, the folder path is: "C:\Users\luca\Desktop\traceroute". This is the image of my files inside the folder:
These are two of the N files that are inside the folder:
Example H11H12Trace.txt:
traceroute to 10.0.12.100 (10.0.12.100), 30 hops max, 60 byte packets
1 10.0.11.10 0.034 ms 0.007 ms 0.006 ms 0.005 ms 0.005 ms 0.005 ms
2 10.0.1.2 0.017 ms 0.009 ms 0.008 ms 0.008 ms 0.040 ms 0.017 ms
3 10.0.12.100 0.026 ms 0.016 ms 0.018 ms 0.014 ms 0.026 ms 0.018 ms
Example H13H34Trace.txt:
traceroute to 10.0.34.100 (10.0.34.100), 30 hops max, 60 byte packets
1 10.0.13.10 0.036 ms 0.007 ms 0.005 ms 0.006 ms 0.005 ms 0.005 ms
2 10.0.1.17 0.017 ms 0.008 ms 0.008 ms 0.008 ms 0.008 ms 0.018 ms
3 10.0.1.14 0.020 ms 0.011 ms 0.011 ms 0.011 ms 0.020 ms 0.016 ms
4 10.0.6.6 0.031 ms 0.023 ms 0.023 ms 0.014 ms 0.013 ms 0.012 ms
5 10.0.7.6 0.023 ms 0.016 ms 0.016 ms 0.015 ms 0.015 ms 0.015 ms
6 10.0.3.5 0.026 ms 0.018 ms 0.022 ms 0.018 ms 0.040 ms 0.019 ms
7 10.0.3.2 0.028 ms 0.021 ms 0.020 ms 0.020 ms 0.021 ms 0.020 ms
8 10.0.34.100 0.030 ms 0.021 ms 0.021 ms 0.020 ms 0.022 ms 0.020 ms
This is my code:
from collections import defaultdict
from pathlib import Path
import pandas as df
my_dir_path = "C:\Users\luca\Desktop\traceroute"
results = defaultdict(list)
for file in Path(my_dir_path).iterdir():
with open(file, "r") as file_open:
results["text"].append(file_open.read())
df = pd.DataFrame(results)
df is:
The dataframe I would like is similar to this.
Source Destination Path_Router Individuals Path_Avg_Delay Path_Individuals_Delay
10.0.11.10 10.0.12.100 [10.0.11.10, [(10.0.11.10 [0.0103, [0.0266,0.0359]
10.0.1.2, ,10.0.1.2), 0.0163,
10.0.12.100] (10.0.1.2, 0.0196]
10.0.12.100)]
10.0.13.10 10.0.34.100 ........... .............. ........ ...................
to create "Path_Router" are the individual routes of the routers.
to create "Individuals" is the coupling between the first router and the second router, then between the second router and the third router, then between the third router and the fourth and so on
to create the "Path_Avg_Delay" I would like to average the single rows of the routers, for example of 10.0.11.10 it will have the average of the 6 delays 0.034 + 0.007 + 0.006 + 0.005 + 0.005 + 0.005/6 =0.0103 and so on
to create the "Path_Individuals_Delay" I would like to make the sum of the averages between the delays of the different individuals ie: 0.0103 + 0.0163 = 0.0266, 0.0163 + 0.0192 = 0.0359
Unfortunately i don't have much experience with Python yet, I hope you can do something
Thank you so much
# create dataframe you want for populating later
columns = ['Source', 'Destination', 'Path_Router', 'Individuals', 'Path_Avg_Delay', 'Path_Individuals_Delay']
df_master = pd.DataFrame(columns=columns)
# depending on how you cleanse your data upfront, you can use the code close to as is or you'll need to adapt to your input
d = """
1 10.0.11.10 0.034 0.007 0.006 0.005 0.005 0.005
2 10.0.1.2 0.017 0.009 0.008 0.008 0.040 0.017
3 10.0.12.100 0.026 0.016 0.018 0.014 0.026 0.018"""
for file in files:
df1 = pd.read_csv(io.StringIO(d), sep=' ', header=None) #<<<<<<< your cleansed data here
Path_Router =df1[1].tolist() #<<<<<<< ip column
Source = Path_Router[0] # first element
Destination = Path_Router[-1] # last element
Individuals = []
for i, ip in enumerate(Path_Router[:-1]): # iterate through list; stop before last element
Individuals.append((Path_Router[i], Path_Router[i+1]))
Path_Avg_Delay = []
for i, row in df1.iterrows():
Path_Avg_Delay.append(row.iloc[2:8].mean()) # columns with delay data
Path_Individuals_Delay = []
for i, ip in enumerate(Path_Avg_Delay[:-1]): # iterate through list; stop before last element
Path_Individuals_Delay.append((Path_Avg_Delay[i] + Path_Avg_Delay[i+1]))
data_list = [Source, Destination, Path_Router, Individuals, Path_Avg_Delay, Path_Individuals_Delay]
df_master.loc[len(df_master)] = data_list # add list to bottom of dataframe
Adding some guidance based on creating the dataframe upfront as you've done. You'll need to pass that column to a function (check out apply() or transform(), not sure which would work best). So rework the code as follows and try it. Of course you'll have lots of tweaking as the data still contains all the 'ms'. I suggest stripping out in the first part of the function. i'll leave that to you. This by no means will work out of the gate, but should get you to where you need to be.
def transform_data(d):
# d is the row element being passed in by apply()
# you're getting the data string now and you need to massage into df1
# if your main df is called df, i think you can write directly to it, if not you can create a separate df and then merge the two at the end
df1 = pd.read_csv(io.StringIO(d), sep=' ', header=None) #<<<<<<< your cleansed data here
Path_Router =df1[1].tolist() #<<<<<<< ip column
Source = Path_Router[0] # first element
Destination = Path_Router[-1] # last element
df['Path_Router'] = Path_Router
df['Destination'] = Destination
df['Source'] = Source
Individuals = []
for i, ip in enumerate(Path_Router[:-1]): # iterate through list; stop before last element
Individuals.append((Path_Router[i], Path_Router[i+1]))
df['Individuals'] = Individuals
Path_Avg_Delay = []
for i, row in df1.iterrows():
Path_Avg_Delay.append(row.iloc[2:8].mean()) # columns with delay data
df['Path_Avg_Delay'] = Path_Avg_Delay
Path_Individuals_Delay = []
for i, ip in enumerate(Path_Avg_Delay[:-1]): # iterate through list; stop before last element
Path_Individuals_Delay.append((Path_Avg_Delay[i] + Path_Avg_Delay[i+1]))
df['Path_Individuals_Delay'] = Path_Individuals_Delay
like i said, you may get some errors when trying to write the df out of the apply statement, so you may have work that out.
i believe you can call the function like this. Your text series will be passed in row by row to the function. Updating the df you are reading from is probably bad form so again, think about writing to df_master and merging to original df.
df = df.apply(lambda x: transform_data(x['text']))
I'm not terribly familiar with df.transform() but this could be investigated as well.
Before using regex you can split read the files with:
with open('filename', 'r') as f:
file = f.read()
this method manages freeing up the memory after reading the file after that you can process your files by reading lines and splitting text or by using regex but i don't suggest using regex in this case
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.