[英]Combine three columns into one in CSV file with python and pandas
嗨,我正在嘗試將幾個現有列合並為1個新列,然后在CSV文件中刪除三個原始列。 我一直在嘗試用熊貓做這件事,但是運氣並不好。 我是python的新手。
我的代碼首先在同一個目錄中合並了幾個CSV文件,然后嘗試操縱這些列。 第一個合並工作,我得到了包含合並數據的output.csv,但是列的合並卻沒有。
import glob
import pandas as pd
interesting_files = glob.glob("*.csv")
header_saved = False
with open('output.csv','wb') as fout:
for filename in interesting_files:
with open(filename) as fin:
header = next(fin)
if not header_saved:
fout.write(header)
header_saved = True
for line in fin:
fout.write(line)
df = pd.read_csv("output.csv")
df['HostAffected']=df['Host'] + "/" + df['Protocol'] + "/" + df['Port']
df.to_csv("newoutput.csv")
有效地解決這個問題:
Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670
變成這樣的東西:
HostsAffected
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.10/tcp/445
10.0.0.11/tcp/445
10.0.0.11/tcp/49707
10.0.0.11/tcp/49672
10.0.0.11/tcp/49670
10.0.0.11/tcp/49668
10.0.0.11/tcp/49667
csv中還有其他列。
我不是編碼員,我只是想解決一個問題,對您的幫助非常感謝。
從我的角度來看,我們有三種選擇:
%timeit df['Host'] + "/" + df['Protocol'] + "/" + df['Port'].map(str)
%timeit ['/'.join(i) for i in zip(df['Host'],df['Protocol'],df['Port'].map(str))]
%timeit ['/'.join(i) for i in df[['Host','Protocol','Port']].astype(str).values]
時間 :
10 loops, best of 3: 39.7 ms per loop
10 loops, best of 3: 35.9 ms per loop
10 loops, best of 3: 162 ms per loop
無論多么慢,我認為這都是您最易讀的方法:
import pandas as pd
data = '''\
ID,Host,Protocol,Port
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,445
1,10.0.0.10,tcp,49707
1,10.0.0.10,tcp,49672
1,10.0.0.10,tcp,49670'''
df = pd.read_csv(pd.compat.StringIO(data)) # Recreates a sample dataframe
cols = ['Host','Protocol','Port']
newcol = ['/'.join(i) for i in df[cols].astype(str).values]
df = df.assign(HostAffected=newcol).drop(cols, 1)
print(df)
返回值:
ID HostAffected
0 1 10.0.0.10/tcp/445
1 1 10.0.0.10/tcp/445
2 1 10.0.0.10/tcp/445
3 1 10.0.0.10/tcp/445
4 1 10.0.0.10/tcp/445
5 1 10.0.0.10/tcp/445
6 1 10.0.0.10/tcp/445
7 1 10.0.0.10/tcp/49707
8 1 10.0.0.10/tcp/49672
9 1 10.0.0.10/tcp/49670
有兩種方法可以執行此操作:使用矢量化函數來組合序列,或者將lambda
函數與pd.Series.apply
一起pd.Series.apply
。
向量化解決方案
不要忘記將非數字類型轉換為str
。
df['HostAffected'] = df['Host'] + '/' + df['Protocol'] + '/' + df['Port'].map(str)
性能說明: 將一系列int轉換為字符串-為什么應用比astype快得多?
應用lambda
函數
df['HostsAffected'] = df.apply(lambda x: '/'.join(list(map(str, x))), axis=1)
使用這兩種解決方案,您都可以按此列進行過濾以刪除所有其他解決方案:
df = df[['HostsAffected']]
完整的例子
from io import StringIO
import pandas as pd
mystr = StringIO("""Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670""")
# replace mystr with 'file.csv'
df = pd.read_csv(mystr)
# combine columns
df['HostsAffected'] = df['Host'] + '/' + df['Protocol'] + '/' + df['Port'].map(str)
# include only new columns
df = df[['HostsAffected']]
結果:
print(df)
HostsAffected
0 10.0.0.10/tcp/445
1 10.0.0.10/tcp/445
2 10.0.0.10/tcp/445
3 10.0.0.10/tcp/445
4 10.0.0.10/tcp/445
5 10.0.0.10/tcp/445
6 10.0.0.10/tcp/445
7 10.0.0.10/tcp/49707
8 10.0.0.10/tcp/49672
9 10.0.0.10/tcp/49670
這是您可以執行的操作:
dt = """Host,Protocol,Port
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,445
10.0.0.10,tcp,49707
10.0.0.10,tcp,49672
10.0.0.10,tcp,49670"""
tdf = pd.read_csv(pd.compat.StringIO(dt))
tdf['HostsAffected'] = tdf.apply(lambda x: '{}/{}/{}'.format(x['Host'] , x['Protocol'] , x['Port']), axis=1)
tdf = tdf[['HostsAffected']]
tdf.to_csv(<path-to-save-csv-file>)
這將是輸出:
HostsAffected
0 10.0.0.10/tcp/445
1 10.0.0.10/tcp/445
2 10.0.0.10/tcp/445
3 10.0.0.10/tcp/445
4 10.0.0.10/tcp/445
5 10.0.0.10/tcp/445
6 10.0.0.10/tcp/445
7 10.0.0.10/tcp/49707
8 10.0.0.10/tcp/49672
9 10.0.0.10/tcp/49670
如果要從文件讀取CSV,請按如下所示編輯read_csv行:
tdf = pd.read_csv(<path-to-the-file>)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.