![](/img/trans.png)
[英]How to create list of txt folders with sql files from pandas dataframe
[英]How to create dataframe from different txt files in python?
我希望能够从桌面文件夹中的几个 txt 文件创建 dataframe,文件夹路径为:“C:\Users\luca\Desktop\traceroute”。 这是文件夹内我的文件的图像:
这些是文件夹内的 N 个文件中的两个:
示例 H11H12Trace.txt:
traceroute to 10.0.12.100 (10.0.12.100), 30 hops max, 60 byte packets
1 10.0.11.10 0.034 ms 0.007 ms 0.006 ms 0.005 ms 0.005 ms 0.005 ms
2 10.0.1.2 0.017 ms 0.009 ms 0.008 ms 0.008 ms 0.040 ms 0.017 ms
3 10.0.12.100 0.026 ms 0.016 ms 0.018 ms 0.014 ms 0.026 ms 0.018 ms
示例 H13H34Trace.txt:
traceroute to 10.0.34.100 (10.0.34.100), 30 hops max, 60 byte packets
1 10.0.13.10 0.036 ms 0.007 ms 0.005 ms 0.006 ms 0.005 ms 0.005 ms
2 10.0.1.17 0.017 ms 0.008 ms 0.008 ms 0.008 ms 0.008 ms 0.018 ms
3 10.0.1.14 0.020 ms 0.011 ms 0.011 ms 0.011 ms 0.020 ms 0.016 ms
4 10.0.6.6 0.031 ms 0.023 ms 0.023 ms 0.014 ms 0.013 ms 0.012 ms
5 10.0.7.6 0.023 ms 0.016 ms 0.016 ms 0.015 ms 0.015 ms 0.015 ms
6 10.0.3.5 0.026 ms 0.018 ms 0.022 ms 0.018 ms 0.040 ms 0.019 ms
7 10.0.3.2 0.028 ms 0.021 ms 0.020 ms 0.020 ms 0.021 ms 0.020 ms
8 10.0.34.100 0.030 ms 0.021 ms 0.021 ms 0.020 ms 0.022 ms 0.020 ms
这是我的代码:
from collections import defaultdict
from pathlib import Path
import pandas as df
my_dir_path = "C:\Users\luca\Desktop\traceroute"
results = defaultdict(list)
for file in Path(my_dir_path).iterdir():
with open(file, "r") as file_open:
results["text"].append(file_open.read())
df = pd.DataFrame(results)
df 是:
我想要的 dataframe 与此类似。
Source Destination Path_Router Individuals Path_Avg_Delay Path_Individuals_Delay
10.0.11.10 10.0.12.100 [10.0.11.10, [(10.0.11.10 [0.0103, [0.0266,0.0359]
10.0.1.2, ,10.0.1.2), 0.0163,
10.0.12.100] (10.0.1.2, 0.0196]
10.0.12.100)]
10.0.13.10 10.0.34.100 ........... .............. ........ ...................
创建“Path_Router”是路由器的各个路由。
创建“个人”是第一台路由器和第二台路由器之间的耦合,然后是第二台路由器和第三台路由器之间的耦合,然后是第三台路由器和第四台路由器之间的耦合,依此类推
要创建“Path_Avg_Delay”,我想平均路由器的单行,例如 10.0.11.10 它将具有 6 个延迟的平均值 0.034 + 0.007 + 0.006 + 0.005 + 0.005 + 0.005/6 =0.0103 等等上
要创建“Path_Individuals_Delay”,我想将不同个体的延迟之间的平均值相加,即:0.0103 + 0.0163 = 0.0266, 0.0163 + 0.0192 = 0.0359
不幸的是,我对 Python 还没有太多经验,我希望你能做点什么
太感谢了
# create dataframe you want for populating later
columns = ['Source', 'Destination', 'Path_Router', 'Individuals', 'Path_Avg_Delay', 'Path_Individuals_Delay']
df_master = pd.DataFrame(columns=columns)
# depending on how you cleanse your data upfront, you can use the code close to as is or you'll need to adapt to your input
d = """
1 10.0.11.10 0.034 0.007 0.006 0.005 0.005 0.005
2 10.0.1.2 0.017 0.009 0.008 0.008 0.040 0.017
3 10.0.12.100 0.026 0.016 0.018 0.014 0.026 0.018"""
for file in files:
df1 = pd.read_csv(io.StringIO(d), sep=' ', header=None) #<<<<<<< your cleansed data here
Path_Router =df1[1].tolist() #<<<<<<< ip column
Source = Path_Router[0] # first element
Destination = Path_Router[-1] # last element
Individuals = []
for i, ip in enumerate(Path_Router[:-1]): # iterate through list; stop before last element
Individuals.append((Path_Router[i], Path_Router[i+1]))
Path_Avg_Delay = []
for i, row in df1.iterrows():
Path_Avg_Delay.append(row.iloc[2:8].mean()) # columns with delay data
Path_Individuals_Delay = []
for i, ip in enumerate(Path_Avg_Delay[:-1]): # iterate through list; stop before last element
Path_Individuals_Delay.append((Path_Avg_Delay[i] + Path_Avg_Delay[i+1]))
data_list = [Source, Destination, Path_Router, Individuals, Path_Avg_Delay, Path_Individuals_Delay]
df_master.loc[len(df_master)] = data_list # add list to bottom of dataframe
添加一些基于预先创建 dataframe 的指导,就像您所做的那样。 您需要将该列传递给 function(查看 apply() 或 transform(),不确定哪个最有效)。 所以重写代码如下并尝试。 当然你会有很多调整,因为数据仍然包含所有的“毫秒”。 我建议在 function 的第一部分删除。 我会把它留给你。 这绝不会一劳永逸,但应该让你到达你需要去的地方。
def transform_data(d):
# d is the row element being passed in by apply()
# you're getting the data string now and you need to massage into df1
# if your main df is called df, i think you can write directly to it, if not you can create a separate df and then merge the two at the end
df1 = pd.read_csv(io.StringIO(d), sep=' ', header=None) #<<<<<<< your cleansed data here
Path_Router =df1[1].tolist() #<<<<<<< ip column
Source = Path_Router[0] # first element
Destination = Path_Router[-1] # last element
df['Path_Router'] = Path_Router
df['Destination'] = Destination
df['Source'] = Source
Individuals = []
for i, ip in enumerate(Path_Router[:-1]): # iterate through list; stop before last element
Individuals.append((Path_Router[i], Path_Router[i+1]))
df['Individuals'] = Individuals
Path_Avg_Delay = []
for i, row in df1.iterrows():
Path_Avg_Delay.append(row.iloc[2:8].mean()) # columns with delay data
df['Path_Avg_Delay'] = Path_Avg_Delay
Path_Individuals_Delay = []
for i, ip in enumerate(Path_Avg_Delay[:-1]): # iterate through list; stop before last element
Path_Individuals_Delay.append((Path_Avg_Delay[i] + Path_Avg_Delay[i+1]))
df['Path_Individuals_Delay'] = Path_Individuals_Delay
就像我说的,当你试图从 apply 语句中写出 df 时,你可能会遇到一些错误,所以你可能已经解决了。
我相信您可以像这样调用 function。 您的文本系列将逐行传递给 function。 更新您正在阅读的 df 可能是不好的形式,所以再次考虑写入 df_master 并合并到原始 df。
df = df.apply(lambda x: transform_data(x['text']))
我对 df.transform() 不是很熟悉,但这也可以调查。
在使用正则表达式之前,您可以使用以下方法拆分读取文件:
with open('filename', 'r') as f:
file = f.read()
此方法管理在读取文件后释放 memory 之后您可以通过读取行和拆分文本或使用正则表达式来处理文件,但我不建议在这种情况下使用正则表达式
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.