如何從沒有通用分隔符的文件創建熊貓數據框？

Question

我有路由表數據，看起來像這樣：

Valid  Network Next-Hop Path Protocol
0  1.0.128.0/17 80.249.208.85 3257 38040 9737 i
0       80.249.209.150 6939 4766 38040 9737 i
1       80.249.209.37 3491 38040 9737 i
0       80.249.211.42 6762 38040 9737 i
0       80.249.208.85 3257 38040 9737 i
1       80.249.209.37 3491 38040 9737 i
0       80.249.211.42 6762 38040 9737 i

我想使用與列和網絡列中的前綴相同的列名稱創建DataFrame。 這里的問題是，並非所有行都有前綴，因此我需要添加最新的前綴（最近看到的）。 這是我所做的：

f = open('initial_data')
current_prefix = None
for index,line in enumerate(f):
    if index != 0 and index != 1058274: #ignoring first and last line
        if line.split(' ')[2].startswith(str(1)) or line.split(' ')[2].startswith(str(2)):
            current_prefix = np.asarray(line.split(' ')[2]) #storing the most recent prefix
            #print(current_prefix)#.shape,type(current_prefix))
            df2 = pd.DataFrame([[current_prefix]], columns=list('Network'))
            df.append(df2,ignore_index = True)#current_prefix)
        else:
            pass#df['Network'].append(current_prefix)
            df2 = pd.DataFrame([[current_prefix]], columns=list('Network'))
            df.append(df2,ignore_index = True)#current_prefix

但是前綴（例如1.0.128.0/17）被解釋為具有7列，並且出現此錯誤：

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-100-f2ee3d75b5c4> in <module>()
      6             current_prefix = np.asarray(line.split(' ')[2])
      7             #print(current_prefix)#.shape,type(current_prefix))
----> 8             df2 = pd.DataFrame([[current_prefix]], columns=list('Network'))
      9             df.append(df2,ignore_index = True)#current_prefix)
AssertionError: 7 columns passed, passed data had 1 columns

那么，有沒有更好/更清潔的方法來解決這個問題？ 更准確地說，我希望DataFrame看起來像這樣：

Valid | Network        | Next-Hop      | Path                 | Protocol
0     | 1.0.128.0/17   | 80.249.208.85 | 3257 38040 9737      | i
0     |NaN/aboveprefix | 80.249.209.150| 6939 4766 38040 9737 | i

等等。 有線索嗎？

Answer 1

我首先用\\ t分隔符編寫了一個新文件，然后使用pandas.read_csv（'file_name'，sep ='\\ t'）解決了這個問題。

這里的主要技巧是計算每一行中的“ /”和“。”：

def classify(strx):
#call this on each line
next_hop = 0
prefix = ""
#path = ""
path = []
for i in strx.split(' '):
    slash_counter,dot_counter = i.count('/'),i.count('.')
    #print(i.count('.'),i.count('/'))
    if dot_counter == 3 and slash_counter == 1:
        prefix = i
    elif dot_counter == 3 and slash_counter == 0:
        next_hop = i
    elif len(i) > 1 and dot_counter == 0 and slash_counter == 0:

        #path = path.join(i)
        path.append(i)
        #print("Sanity check on paths",path,"\t",i)
        #path = path.join(' ')
path_ = path
path = " ".join([str(i) for i in path_])
protocol = strx[-1]
#print(protocol)
#print("prefix = {0}, next hop = {1}, path = {2}".format(prefix,next_hop,path))
return(prefix,next_hop,path,protocol)

一個例子：原始行：

'0 1.0.128.0/17 80.249.208.85 3257 38040 9737 i'

用上面的函數轉換后：

1.0.128.0/17 80.249.208.85 3257 38040 9737我

Answer 2

接受的答案當然是實現此目的的一種方法，但是pandas提供了一種提取字段並完全按照您所描述的方法通過將regex傳遞給Series對象的str.extract()來創建DataFrame 。然后使用method='ffill' .fillna()填寫“網絡”字段中的缺失值。 這種方法將通過以下類似方式完成。

import io
import pandas as pd
import re

f = io.StringIO('''\
Valid  Network Next-Hop Path Protocol
0  1.0.128.0/17 80.249.208.85 3257 38040 9737 i
0       80.249.209.150 6939 4766 38040 9737 i
1       80.249.209.37 3491 38040 9737 i
0       80.249.211.42 6762 38040 9737 i
0       80.249.208.85 3257 38040 9737 i
1       80.249.209.37 3491 38040 9737 i
0       80.249.211.42 6762 38040 9737 i
''')

pattern = re.compile(r'^(?P<valid>[01])'
                     r'\s+(?P<network>\d+\.\d+\.\d+\.\d+/\d+)?'
                     r'\s+(?P<next_hop>\d+\.\d+\.\d+\.\d+)'
                     r'\s+(?P<path>(?:\d+ )+)'
                     r'(?P<protocol>[a-z])$')

next(f) #skip the first line
df = (pd.Series(f.readlines())
      .str.extract(pattern, expand=False)
      .fillna(method='ffill'))

這導致看起來像一個DataFrame 。

Out [26]:
  valid       network        next_hop                   path protocol
0     0  1.0.128.0/17   80.249.208.85       3257 38040 9737         i
1     0  1.0.128.0/17  80.249.209.150  6939 4766 38040 9737         i
2     1  1.0.128.0/17   80.249.209.37       3491 38040 9737         i
3     0  1.0.128.0/17   80.249.211.42       6762 38040 9737         i
4     0  1.0.128.0/17   80.249.208.85       3257 38040 9737         i
5     1  1.0.128.0/17   80.249.209.37       3491 38040 9737         i
6     0  1.0.128.0/17   80.249.211.42       6762 38040 9737         i

如果您不希望填寫缺少的“網絡”值，則可以刪除對.fillna()的調用。

如何從沒有通用分隔符的文件創建熊貓數據框？

問題描述

2 個解決方案

解決方案1
0 已采納 2018-02-12 08:26:12

解決方案2
0 2018-02-20 21:43:37

如何從沒有通用分隔符的文件創建熊貓數據框？

問題描述

2 個解決方案

解決方案1 0 已采納 2018-02-12 08:26:12

解決方案2 0 2018-02-20 21:43:37

解決方案1
0 已采納 2018-02-12 08:26:12

解決方案2
0 2018-02-20 21:43:37