![](/img/trans.png)
[英]How to create list of txt folders with sql files from pandas dataframe
[英]How to create dataframe from different txt files in python?
我希望能夠從桌面文件夾中的幾個 txt 文件創建 dataframe,文件夾路徑為:“C:\Users\luca\Desktop\traceroute”。 這是文件夾內我的文件的圖像:
這些是文件夾內的 N 個文件中的兩個:
示例 H11H12Trace.txt:
traceroute to 10.0.12.100 (10.0.12.100), 30 hops max, 60 byte packets
1 10.0.11.10 0.034 ms 0.007 ms 0.006 ms 0.005 ms 0.005 ms 0.005 ms
2 10.0.1.2 0.017 ms 0.009 ms 0.008 ms 0.008 ms 0.040 ms 0.017 ms
3 10.0.12.100 0.026 ms 0.016 ms 0.018 ms 0.014 ms 0.026 ms 0.018 ms
示例 H13H34Trace.txt:
traceroute to 10.0.34.100 (10.0.34.100), 30 hops max, 60 byte packets
1 10.0.13.10 0.036 ms 0.007 ms 0.005 ms 0.006 ms 0.005 ms 0.005 ms
2 10.0.1.17 0.017 ms 0.008 ms 0.008 ms 0.008 ms 0.008 ms 0.018 ms
3 10.0.1.14 0.020 ms 0.011 ms 0.011 ms 0.011 ms 0.020 ms 0.016 ms
4 10.0.6.6 0.031 ms 0.023 ms 0.023 ms 0.014 ms 0.013 ms 0.012 ms
5 10.0.7.6 0.023 ms 0.016 ms 0.016 ms 0.015 ms 0.015 ms 0.015 ms
6 10.0.3.5 0.026 ms 0.018 ms 0.022 ms 0.018 ms 0.040 ms 0.019 ms
7 10.0.3.2 0.028 ms 0.021 ms 0.020 ms 0.020 ms 0.021 ms 0.020 ms
8 10.0.34.100 0.030 ms 0.021 ms 0.021 ms 0.020 ms 0.022 ms 0.020 ms
這是我的代碼:
from collections import defaultdict
from pathlib import Path
import pandas as df
my_dir_path = "C:\Users\luca\Desktop\traceroute"
results = defaultdict(list)
for file in Path(my_dir_path).iterdir():
with open(file, "r") as file_open:
results["text"].append(file_open.read())
df = pd.DataFrame(results)
df 是:
我想要的 dataframe 與此類似。
Source Destination Path_Router Individuals Path_Avg_Delay Path_Individuals_Delay
10.0.11.10 10.0.12.100 [10.0.11.10, [(10.0.11.10 [0.0103, [0.0266,0.0359]
10.0.1.2, ,10.0.1.2), 0.0163,
10.0.12.100] (10.0.1.2, 0.0196]
10.0.12.100)]
10.0.13.10 10.0.34.100 ........... .............. ........ ...................
創建“Path_Router”是路由器的各個路由。
創建“個人”是第一台路由器和第二台路由器之間的耦合,然后是第二台路由器和第三台路由器之間的耦合,然后是第三台路由器和第四台路由器之間的耦合,依此類推
要創建“Path_Avg_Delay”,我想平均路由器的單行,例如 10.0.11.10 它將具有 6 個延遲的平均值 0.034 + 0.007 + 0.006 + 0.005 + 0.005 + 0.005/6 =0.0103 等等上
要創建“Path_Individuals_Delay”,我想將不同個體的延遲之間的平均值相加,即:0.0103 + 0.0163 = 0.0266, 0.0163 + 0.0192 = 0.0359
不幸的是,我對 Python 還沒有太多經驗,我希望你能做點什么
太感謝了
# create dataframe you want for populating later
columns = ['Source', 'Destination', 'Path_Router', 'Individuals', 'Path_Avg_Delay', 'Path_Individuals_Delay']
df_master = pd.DataFrame(columns=columns)
# depending on how you cleanse your data upfront, you can use the code close to as is or you'll need to adapt to your input
d = """
1 10.0.11.10 0.034 0.007 0.006 0.005 0.005 0.005
2 10.0.1.2 0.017 0.009 0.008 0.008 0.040 0.017
3 10.0.12.100 0.026 0.016 0.018 0.014 0.026 0.018"""
for file in files:
df1 = pd.read_csv(io.StringIO(d), sep=' ', header=None) #<<<<<<< your cleansed data here
Path_Router =df1[1].tolist() #<<<<<<< ip column
Source = Path_Router[0] # first element
Destination = Path_Router[-1] # last element
Individuals = []
for i, ip in enumerate(Path_Router[:-1]): # iterate through list; stop before last element
Individuals.append((Path_Router[i], Path_Router[i+1]))
Path_Avg_Delay = []
for i, row in df1.iterrows():
Path_Avg_Delay.append(row.iloc[2:8].mean()) # columns with delay data
Path_Individuals_Delay = []
for i, ip in enumerate(Path_Avg_Delay[:-1]): # iterate through list; stop before last element
Path_Individuals_Delay.append((Path_Avg_Delay[i] + Path_Avg_Delay[i+1]))
data_list = [Source, Destination, Path_Router, Individuals, Path_Avg_Delay, Path_Individuals_Delay]
df_master.loc[len(df_master)] = data_list # add list to bottom of dataframe
添加一些基於預先創建 dataframe 的指導,就像您所做的那樣。 您需要將該列傳遞給 function(查看 apply() 或 transform(),不確定哪個最有效)。 所以重寫代碼如下並嘗試。 當然你會有很多調整,因為數據仍然包含所有的“毫秒”。 我建議在 function 的第一部分刪除。 我會把它留給你。 這絕不會一勞永逸,但應該讓你到達你需要去的地方。
def transform_data(d):
# d is the row element being passed in by apply()
# you're getting the data string now and you need to massage into df1
# if your main df is called df, i think you can write directly to it, if not you can create a separate df and then merge the two at the end
df1 = pd.read_csv(io.StringIO(d), sep=' ', header=None) #<<<<<<< your cleansed data here
Path_Router =df1[1].tolist() #<<<<<<< ip column
Source = Path_Router[0] # first element
Destination = Path_Router[-1] # last element
df['Path_Router'] = Path_Router
df['Destination'] = Destination
df['Source'] = Source
Individuals = []
for i, ip in enumerate(Path_Router[:-1]): # iterate through list; stop before last element
Individuals.append((Path_Router[i], Path_Router[i+1]))
df['Individuals'] = Individuals
Path_Avg_Delay = []
for i, row in df1.iterrows():
Path_Avg_Delay.append(row.iloc[2:8].mean()) # columns with delay data
df['Path_Avg_Delay'] = Path_Avg_Delay
Path_Individuals_Delay = []
for i, ip in enumerate(Path_Avg_Delay[:-1]): # iterate through list; stop before last element
Path_Individuals_Delay.append((Path_Avg_Delay[i] + Path_Avg_Delay[i+1]))
df['Path_Individuals_Delay'] = Path_Individuals_Delay
就像我說的,當你試圖從 apply 語句中寫出 df 時,你可能會遇到一些錯誤,所以你可能已經解決了。
我相信您可以像這樣調用 function。 您的文本系列將逐行傳遞給 function。 更新您正在閱讀的 df 可能是不好的形式,所以再次考慮寫入 df_master 並合並到原始 df。
df = df.apply(lambda x: transform_data(x['text']))
我對 df.transform() 不是很熟悉,但這也可以調查。
在使用正則表達式之前,您可以使用以下方法拆分讀取文件:
with open('filename', 'r') as f:
file = f.read()
此方法管理在讀取文件后釋放 memory 之后您可以通過讀取行和拆分文本或使用正則表達式來處理文件,但我不建議在這種情況下使用正則表達式
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.