[英]How to create a pandas dataframe from a file with no common separator?
I have routing table data that looks something like this: 我有路由表数据,看起来像这样:
Valid Network Next-Hop Path Protocol
0 1.0.128.0/17 80.249.208.85 3257 38040 9737 i
0 80.249.209.150 6939 4766 38040 9737 i
1 80.249.209.37 3491 38040 9737 i
0 80.249.211.42 6762 38040 9737 i
0 80.249.208.85 3257 38040 9737 i
1 80.249.209.37 3491 38040 9737 i
0 80.249.211.42 6762 38040 9737 i
I want to create DataFrame with those same column names as the header and the prefix in the network column. 我想使用与列和网络列中的前缀相同的列名称创建DataFrame。 The problem here is that not all lines have a prefix so I need to add the latest prefix (most recently seen).
这里的问题是,并非所有行都有前缀,因此我需要添加最新的前缀(最近看到的)。 This is what I did:
这是我所做的:
f = open('initial_data')
current_prefix = None
for index,line in enumerate(f):
if index != 0 and index != 1058274: #ignoring first and last line
if line.split(' ')[2].startswith(str(1)) or line.split(' ')[2].startswith(str(2)):
current_prefix = np.asarray(line.split(' ')[2]) #storing the most recent prefix
#print(current_prefix)#.shape,type(current_prefix))
df2 = pd.DataFrame([[current_prefix]], columns=list('Network'))
df.append(df2,ignore_index = True)#current_prefix)
else:
pass#df['Network'].append(current_prefix)
df2 = pd.DataFrame([[current_prefix]], columns=list('Network'))
df.append(df2,ignore_index = True)#current_prefix
But the prefix (eg 1.0.128.0/17) is interpreted as having 7 columns and I get this error: 但是前缀(例如1.0.128.0/17)被解释为具有7列,并且出现此错误:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-100-f2ee3d75b5c4> in <module>()
6 current_prefix = np.asarray(line.split(' ')[2])
7 #print(current_prefix)#.shape,type(current_prefix))
----> 8 df2 = pd.DataFrame([[current_prefix]], columns=list('Network'))
9 df.append(df2,ignore_index = True)#current_prefix)
AssertionError: 7 columns passed, passed data had 1 columns
So is there any better/cleaner way to deal with this? 那么,有没有更好/更清洁的方法来解决这个问题? To be more precise, I would like the DataFrame to look something like this:
更准确地说,我希望DataFrame看起来像这样:
Valid | Network | Next-Hop | Path | Protocol
0 | 1.0.128.0/17 | 80.249.208.85 | 3257 38040 9737 | i
0 |NaN/aboveprefix | 80.249.209.150| 6939 4766 38040 9737 | i
and so on. 等等。 Any leads?
有线索吗?
I Solved this first by writing a new file with \\t separations and then using pandas.read_csv('file_name',sep='\\t'). 我首先用\\ t分隔符编写了一个新文件,然后使用pandas.read_csv('file_name',sep ='\\ t')解决了这个问题。
The main trick here was to count the '/' and '.'s in each split of line: 这里的主要技巧是计算每一行中的“ /”和“。”:
def classify(strx):
#call this on each line
next_hop = 0
prefix = ""
#path = ""
path = []
for i in strx.split(' '):
slash_counter,dot_counter = i.count('/'),i.count('.')
#print(i.count('.'),i.count('/'))
if dot_counter == 3 and slash_counter == 1:
prefix = i
elif dot_counter == 3 and slash_counter == 0:
next_hop = i
elif len(i) > 1 and dot_counter == 0 and slash_counter == 0:
#path = path.join(i)
path.append(i)
#print("Sanity check on paths",path,"\t",i)
#path = path.join(' ')
path_ = path
path = " ".join([str(i) for i in path_])
protocol = strx[-1]
#print(protocol)
#print("prefix = {0}, next hop = {1}, path = {2}".format(prefix,next_hop,path))
return(prefix,next_hop,path,protocol)
an example: original line: 一个例子:原始行:
'0 1.0.128.0/17 80.249.208.85 3257 38040 9737 i'
'0 1.0.128.0/17 80.249.208.85 3257 38040 9737 i'
after converting with above function: 用上面的函数转换后:
1.0.128.0/17 80.249.208.85 3257 38040 9737 i
1.0.128.0/17 80.249.208.85 3257 38040 9737我
The accepted answer is certainly a way to do this, but pandas
provides a way to extract fields and fill in missing values exactly as you've described by passing a regex to str.extract()
on a Series
object to create a DataFrame
, and then .fillna()
with method='ffill'
to fill in the missing values in the "Network" field. 接受的答案当然是实现此目的的一种方法,但是
pandas
提供了一种提取字段并完全按照您所描述的方法通过将regex传递给Series
对象的str.extract()
来创建DataFrame
。然后使用method='ffill'
.fillna()
填写“网络”字段中的缺失值。 This approach would be accomplished by something like the following. 这种方法将通过以下类似方式完成。
import io
import pandas as pd
import re
f = io.StringIO('''\
Valid Network Next-Hop Path Protocol
0 1.0.128.0/17 80.249.208.85 3257 38040 9737 i
0 80.249.209.150 6939 4766 38040 9737 i
1 80.249.209.37 3491 38040 9737 i
0 80.249.211.42 6762 38040 9737 i
0 80.249.208.85 3257 38040 9737 i
1 80.249.209.37 3491 38040 9737 i
0 80.249.211.42 6762 38040 9737 i
''')
pattern = re.compile(r'^(?P<valid>[01])'
r'\s+(?P<network>\d+\.\d+\.\d+\.\d+/\d+)?'
r'\s+(?P<next_hop>\d+\.\d+\.\d+\.\d+)'
r'\s+(?P<path>(?:\d+ )+)'
r'(?P<protocol>[a-z])$')
next(f) #skip the first line
df = (pd.Series(f.readlines())
.str.extract(pattern, expand=False)
.fillna(method='ffill'))
This results in a DataFrame
that looks like. 这导致看起来像一个
DataFrame
。
Out [26]:
valid network next_hop path protocol
0 0 1.0.128.0/17 80.249.208.85 3257 38040 9737 i
1 0 1.0.128.0/17 80.249.209.150 6939 4766 38040 9737 i
2 1 1.0.128.0/17 80.249.209.37 3491 38040 9737 i
3 0 1.0.128.0/17 80.249.211.42 6762 38040 9737 i
4 0 1.0.128.0/17 80.249.208.85 3257 38040 9737 i
5 1 1.0.128.0/17 80.249.209.37 3491 38040 9737 i
6 0 1.0.128.0/17 80.249.211.42 6762 38040 9737 i
If you don't want the missing "Network" values filled in, you can remove the call to .fillna()
. 如果您不希望填写缺少的“网络”值,则可以删除对
.fillna()
的调用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.