如何從 python 中的不同 txt 文件創建 dataframe？

Question

我希望能夠從桌面文件夾中的幾個 txt 文件創建 dataframe，文件夾路徑為：“C:\Users\luca\Desktop\traceroute”。 這是文件夾內我的文件的圖像：

這些是文件夾內的 N 個文件中的兩個：

示例 H11H12Trace.txt：

traceroute to 10.0.12.100 (10.0.12.100), 30 hops max, 60 byte packets
 1  10.0.11.10  0.034 ms  0.007 ms  0.006 ms  0.005 ms  0.005 ms  0.005 ms 
 2  10.0.1.2  0.017 ms  0.009 ms  0.008 ms  0.008 ms  0.040 ms  0.017 ms   
 3  10.0.12.100  0.026 ms  0.016 ms  0.018 ms  0.014 ms  0.026 ms  0.018 ms

示例 H13H34Trace.txt：

traceroute to 10.0.34.100 (10.0.34.100), 30 hops max, 60 byte packets
 1  10.0.13.10  0.036 ms  0.007 ms  0.005 ms  0.006 ms  0.005 ms  0.005 ms
 2  10.0.1.17  0.017 ms  0.008 ms  0.008 ms  0.008 ms  0.008 ms  0.018 ms
 3  10.0.1.14  0.020 ms  0.011 ms  0.011 ms  0.011 ms  0.020 ms  0.016 ms
 4  10.0.6.6  0.031 ms  0.023 ms  0.023 ms  0.014 ms  0.013 ms  0.012 ms
 5  10.0.7.6  0.023 ms  0.016 ms  0.016 ms  0.015 ms  0.015 ms  0.015 ms
 6  10.0.3.5  0.026 ms  0.018 ms  0.022 ms  0.018 ms  0.040 ms  0.019 ms
 7  10.0.3.2  0.028 ms  0.021 ms  0.020 ms  0.020 ms  0.021 ms  0.020 ms
 8  10.0.34.100  0.030 ms  0.021 ms  0.021 ms  0.020 ms  0.022 ms 0.020 ms

這是我的代碼：

from collections import defaultdict
from pathlib import Path
import pandas as df

my_dir_path = "C:\Users\luca\Desktop\traceroute"

results = defaultdict(list)
for file in Path(my_dir_path).iterdir():
    with open(file, "r") as file_open:
        results["text"].append(file_open.read())
df = pd.DataFrame(results)

df 是：

我想要的 dataframe 與此類似。

   Source     Destination  Path_Router  Individuals  Path_Avg_Delay  Path_Individuals_Delay 
    
  10.0.11.10  10.0.12.100   [10.0.11.10, [(10.0.11.10   [0.0103,      [0.0266,0.0359]
                             10.0.1.2,    ,10.0.1.2),    0.0163,
                             10.0.12.100] (10.0.1.2,     0.0196]
                                          10.0.12.100)]  
  10.0.13.10  10.0.34.100   ...........   ..............  ........    ...................

創建“Path_Router”是路由器的各個路由。

創建“個人”是第一台路由器和第二台路由器之間的耦合，然后是第二台路由器和第三台路由器之間的耦合，然后是第三台路由器和第四台路由器之間的耦合，依此類推

要創建“Path_Avg_Delay”，我想平均路由器的單行，例如 10.0.11.10 它將具有 6 個延遲的平均值 0.034 + 0.007 + 0.006 + 0.005 + 0.005 + 0.005/6 =0.0103 等等上

要創建“Path_Individuals_Delay”，我想將不同個體的延遲之間的平均值相加，即：0.0103 + 0.0163 = 0.0266, 0.0163 + 0.0192 = 0.0359

不幸的是，我對 Python 還沒有太多經驗，我希望你能做點什么

太感謝了

Answer 1

# create dataframe you want for populating later
columns = ['Source', 'Destination', 'Path_Router', 'Individuals', 'Path_Avg_Delay', 'Path_Individuals_Delay']
df_master = pd.DataFrame(columns=columns)

# depending on how you cleanse your data upfront, you can use the code close to as is or you'll need to adapt to your input

d = """
1 10.0.11.10 0.034 0.007 0.006 0.005 0.005 0.005
2 10.0.1.2 0.017 0.009 0.008 0.008 0.040 0.017
3 10.0.12.100 0.026 0.016 0.018 0.014 0.026 0.018"""

for file in files:
    df1 = pd.read_csv(io.StringIO(d), sep=' ', header=None) #<<<<<<< your cleansed data here

    Path_Router =df1[1].tolist() #<<<<<<< ip column
    Source = Path_Router[0] # first element
    Destination = Path_Router[-1] # last element

    Individuals = []
    for i, ip in enumerate(Path_Router[:-1]): # iterate through list; stop before last element
        Individuals.append((Path_Router[i], Path_Router[i+1]))

    Path_Avg_Delay = []
    for i, row in df1.iterrows():
        Path_Avg_Delay.append(row.iloc[2:8].mean()) # columns with delay data

    Path_Individuals_Delay = []
    for i, ip in enumerate(Path_Avg_Delay[:-1]): # iterate through list; stop before last element
        Path_Individuals_Delay.append((Path_Avg_Delay[i] + Path_Avg_Delay[i+1]))

    data_list = [Source, Destination, Path_Router, Individuals, Path_Avg_Delay, Path_Individuals_Delay]
    df_master.loc[len(df_master)] = data_list # add list to bottom of dataframe

添加一些基於預先創建 dataframe 的指導，就像您所做的那樣。 您需要將該列傳遞給 function（查看 apply() 或 transform()，不確定哪個最有效）。 所以重寫代碼如下並嘗試。 當然你會有很多調整，因為數據仍然包含所有的“毫秒”。 我建議在 function 的第一部分刪除。 我會把它留給你。 這絕不會一勞永逸，但應該讓你到達你需要去的地方。

def transform_data(d):
    # d is the row element being passed in by apply()
    # you're getting the data string now and you need to massage into df1
    # if your main df is called df, i think you can write directly to it, if not you can create a separate df and then merge the two at the end

    df1 = pd.read_csv(io.StringIO(d), sep=' ', header=None) #<<<<<<< your cleansed data here

    Path_Router =df1[1].tolist() #<<<<<<< ip column
    Source = Path_Router[0] # first element
    Destination = Path_Router[-1] # last element
    df['Path_Router'] = Path_Router
    df['Destination'] = Destination
    df['Source'] = Source

    Individuals = []
    for i, ip in enumerate(Path_Router[:-1]): # iterate through list; stop before last element
        Individuals.append((Path_Router[i], Path_Router[i+1]))
    df['Individuals'] = Individuals


    Path_Avg_Delay = []
    for i, row in df1.iterrows():
        Path_Avg_Delay.append(row.iloc[2:8].mean()) # columns with delay data
    df['Path_Avg_Delay'] = Path_Avg_Delay

    Path_Individuals_Delay = []
    for i, ip in enumerate(Path_Avg_Delay[:-1]): # iterate through list; stop before last element
        Path_Individuals_Delay.append((Path_Avg_Delay[i] + Path_Avg_Delay[i+1]))
    df['Path_Individuals_Delay'] = Path_Individuals_Delay

就像我說的，當你試圖從 apply 語句中寫出 df 時，你可能會遇到一些錯誤，所以你可能已經解決了。

我相信您可以像這樣調用 function。 您的文本系列將逐行傳遞給 function。 更新您正在閱讀的 df 可能是不好的形式，所以再次考慮寫入 df_master 並合並到原始 df。

df = df.apply(lambda x: transform_data(x['text']))

我對 df.transform() 不是很熟悉，但這也可以調查。

Answer 2

在使用正則表達式之前，您可以使用以下方法拆分讀取文件：

with open('filename', 'r') as f: 
     file = f.read()

此方法管理在讀取文件后釋放 memory 之后您可以通過讀取行和拆分文本或使用正則表達式來處理文件，但我不建議在這種情況下使用正則表達式

如何從 python 中的不同 txt 文件創建 dataframe？

問題描述

2 個解決方案

解決方案1
1 已采納 2020-12-14 20:05:40

解決方案2
0 2020-12-14 19:57:08

如何從 python 中的不同 txt 文件創建 dataframe？

問題描述

2 個解決方案

解決方案1 1 已采納 2020-12-14 20:05:40

解決方案2 0 2020-12-14 19:57:08

解決方案1
1 已采納 2020-12-14 20:05:40

解決方案2
0 2020-12-14 19:57:08