如何在閱讀 CSV 時使用 Pandas 編寫更清潔和高性能的代碼

Question

我正在研究CSV數據表並想要解析和過濾數據，在處理代碼時，我發現有人在SO POST上詢問過類似的代碼，並且作者擁有與我看到的幾乎相同的硬件數據與我有一些數據和列不同的 HPE H/W 相關。

樣本數據：

Status  Server  Server Name Bay #   Model   Processor   Proc. Count Memory  Serial Number   State   Power State iLO FW  Firmware    Appliance Name
Critical    enc2010, bay 1  tdm2066.example.com 1   ProLiant BL460c Gen9    Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz   2   262144  2M272101N9  Unmanaged   On  2.53 May 03 2017    I36 v2.40 (02/17/2017)  OV C7000 enclosures 1
OK  enc1011, bay 1  tdm1068.example.com 1   ProLiant BL460c Gen9    Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz   2   262144  2M272101P6  Monitored   On  2.55 Aug 16 2017    I36 v2.74 (07/21/2019)  OV C7000 enclosures 1
OK  enc1012, bay 1  tdm1083.example.com 1   ProLiant BL460c Gen9    Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz   2   262144  2M272101NX  Monitored   On  2.61 Jul 27 2018    I36 v2.60 (05/21/2018)  OV C7000 enclosures 1
OK  ENC2004, bay 1  tdm2033.example.com 1   ProLiant BL460c Gen9    Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz   2   524288  2M262602L2  Monitored   On  2.55 Aug 16 2017    I36 v2.52 (10/25/2017)  OV C7000 enclosures 1
OK  ENC2006, bay 1  vds2009 1   ProLiant BL460c Gen9    Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz   2   524288  2M263604ZZ  Monitored   On  2.40 Dec 02 2015    I36 v2.20 (05/05/2016)  OV C7000 enclosures 1
OK  ENC2011, bay 1  tdm2081.example.com 1   ProLiant BL460c Gen9    Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz   2   524288  2M2708027Z  Monitored   On  2.55 Aug 16 2017    I36 v2.52 (10/25/2017)  OV C7000 enclosures 1
OK  ENC1003, bay 1  tdm1024.example.com 1   ProLiant BL460c Gen9    Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz   2   524288  2M262602KW  Monitored   On  2.73 Feb 11 2020    I36 v2.52 (10/25/2017)  OV C7000 enclosures 1
OK  ENC1006, bay 1  vds1009 1   ProLiant BL460c Gen9    Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz   2   524288  2M262505V5  Monitored   On  2.40 Dec 02 2015    I36 v2.00 (12/28/2015)  OV C7000 enclosures 1
OK  ENC1007, bay 1  vds1023 1   ProLiant BL460c Gen9    Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz   2   524288  2M264800TR  Monitored   On  2.50 Sep 23 2016    I36 v2.30 (09/12/2016)  OV C7000 enclosures 1

DataFrame：

df = pd.read_csv("testcreate.csv", sep="\t")
df = df[[ 'Server', 'Server Name', 'Bay #',  'Appliance Name']]    
df['Bay'] = df['Server'].str.split(',').str[1].str.lower()
df['Enclosure'] = df['Server'].str.split(',').str[0].str.upper()
df['Server Name'] = df['Server Name'].str.split('.').str[0]
df = df.drop(['Server', 'Bay #'], axis=1)
df = df[df['Appliance Name'].str.contains('C7000')]

Dataframe output：

   Server Name         Appliance Name     Bay Enclosure
0      tdm2066  OV C7000 enclosures 1   bay 1   ENC2010
1      tdm1068  OV C7000 enclosures 1   bay 1   ENC1011
2      tdm1083  OV C7000 enclosures 1   bay 1   ENC1012
3      tdm2033  OV C7000 enclosures 1   bay 1   ENC2004
4      vds2009  OV C7000 enclosures 1   bay 1   ENC2006
5      tdm2081  OV C7000 enclosures 1   bay 1   ENC2011
6      tdm1024  OV C7000 enclosures 1   bay 1   ENC1003
7      vds1009  OV C7000 enclosures 1   bay 1   ENC1006
8      vds1023  OV C7000 enclosures 1   bay 1   ENC1007
9      vds0003  OV C7000 enclosures 1   bay 1   ENT0003
10     tdm7123  OV C7000 enclosures 1   bay 1   ENC7003
11     tdm2231  OV C7000 enclosures 1   bay 1   ENC2022
12     tdm2186  OV C7000 enclosures 1   bay 1   ENC2018
13     tdm1098  OV C7000 enclosures 1   bay 1   ENC1013
14     tdm1158  OV C7000 enclosures 1   bay 1   ENC1017
15     tdm2096  OV C7000 enclosures 1   bay 1   ENC2012
16     tdm1012  OV C7000 enclosures 1   bay 1   ENC1002
17     tdm1062  OV C7000 enclosures 1   bay 1   ENC1009
18     vds1041  OV C7000 enclosures 1   bay 1   ENC1010
19     vds1001  OV C7000 enclosures 1   bay 1   ENC1005
20     vds7025  OV C7000 enclosures 1   bay 1   ENC7009
21     vds2023  OV C7000 enclosures 1   bay 1   ENC2007
22     tdm7068  OV C7000 enclosures 1   bay 1   ENC7005
23     vds7006  OV C7000 enclosures 1   bay 1   ENC7006
24     tdm2126  OV C7000 enclosures 1   bay 1   ENC2014
25     vds2001  OV C7000 enclosures 1   bay 1   ENC2005
26     tdm1173  OV C7000 enclosures 1   bay 1   ENC1018
27     tdm1250  OV C7000 enclosures 1   bay 1   ENC1025

我嘗試了什么：

我借用了df1 = pd.concat( [g.set_index('Bay').add_suffix(f'_{n}') for n, g in df.groupby('Enclosure')], axis=1, sort=False).filter( like='Server Name').dropna(how='all', axis=1) <-- 這來自提到的 SO 帖子，但我沒有完全明白，我也不想添加后綴即Server Name ，因此它應該只like Enc1002` 等。

df1 = pd.concat( [g.set_index('Bay').add_suffix(f'_{n}') for n, g in df.groupby('Enclosure')], axis=1, sort=False).filter( like='Server Name').dropna(how='all', axis=1)

print(df1)

結果：

期望：

        ENC1002 ENC1003 ENC1005
bay 1   tdm1012 tdm1024 vds1001

編輯：

我從@Scott 獲得了desired解決方案的解決方案。

df = pd.concat([g.set_index('Bay')['Server Name'].rename(f'{n}') for n, g in df.groupby('Enclosure')],  axis=1, sort=False)

我的代碼可能有點亂，顯示在Dataframe下，有沒有辦法更好地編碼它，只是在這里問它以獲得更好的建議和代碼寫作..

Answer 1

我希望我得到了正確的。

# Read the CSV & assign it to `text`
with open('estcreate.csv', 'r') as fh:
    text = fh.read()

enc = dict()

for line in text.splitlines()[1:]:
    status, enclosure, bay, bay_no, vds, *na = line.split()
    
    enclosure = enclosure.replace(',','').upper()
    vds = vds.lower().split('.')[0]
    
    if enclosure not in enc:
        enc[enclosure] = dict()
    if bay_no not in enc[enclosure]:
        enc[enclosure][bay_no] = vds
    
    
>>> df = pd.DataFrame.from_dict(enc)
>>> df 

    ENC2010 ENC1011 ENC1012 ENC2004 ENC2006 ENC2011 ENC1003 ENC1006 ENC1007 ENT0003 ... ENC1010 ENC1005 ENC7009 ENC2007 ENC7005 ENC7006 ENC2014 ENC2005 ENC1018 ENC1025
1   tdm2066 tdm1068 tdm1083 tdm2033 vds2009 tdm2081 tdm1024 vds1009 vds1023 vds0003 ... vds1041 vds1001 vds7025 vds2023 tdm7068 vds7006 tdm2126 vds2001 tdm1173 tdm1250

>>> df.T
    1
ENC2010 tdm2066
ENC1011 tdm1068
ENC1012 tdm1083
ENC2004 tdm2033
ENC2006 vds2009
ENC2011 tdm2081
...

Answer 2

如果我理解正確，您想要 pivot 您的表：

df1 = pd.pivot(df, values='Server Name', index='Bay', columns='Enclosure')


Enclosure  ENC1003  ENC1006  ENC1007  ENC1011  ENC1012  ENC2004  ENC2006  ENC2010  ENC2011  ENT0003
Bay                                                                                                
bay 1      tdm1024  vds1009  vds1023  tdm1068  tdm1083  tdm2033  vds2009  tdm2066  tdm2081  vds0003

Answer 3

在閱讀了您的帖子和代碼塊之后，我在操作中看不到太多 scope 。 我想像下面這樣更改它，這將提供所需的結果。 但是，我在代碼審查中看到了類似的帖子。

1-您應該選擇所需的列，這將減少處理負擔和靈活性，您可以使用usecols 。

2-您可以將df.assign與 Dict 一起使用，它將根據Keys, values從兩個 collections 創建一個 dict，這將基於,通過拆分創建兩個不同的列，因此您可以執行split 、 rename和drop操作在一個 go 中。

它將像下面的 go 一樣，應該可以工作。

import pandas as pd
##### Pandas setting in case you want to visualize them on the screen. ####
#pd.set_option('display.height', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('expand_frame_repr', True)
##################### END OF THE Display Settings ###################
# usecols is supposed to provide a filter before reading the whole DataFrame
# into memory; if used properly, there should never be a need to delete columns
# after reading.
df = pd.read_csv("testcreate.csv",
                  usecols=['Server',
                           'Server Name',
                           'Appliance Name'
                          ]
                )
df1  = df.assign(**dict
              (zip
              ('xy',
              df.Server.str.split(',')
              .str
              )
              )
              ).rename(columns=
              {'x': 'Enclosure',
               'y': 'Bay'
              }
              ).drop(['Server'], axis=1)
df1 = df1[
          df1['Appliance Name']
          .str.contains('C7000')
         ]
df1['Server Name'] = df1['Server Name'].str.split('.').str[0].str.lower()

df1['Enclosure']  = df1['Enclosure'].str.upper()

df1 = pd.pivot(df1,
               values='Server Name',
               index='Bay',
               columns='Enclosure'
               ).rename_axis(None)

df1.to_csv("YourCsvFileName.csv")
# Print(df1)

如何在閱讀 CSV 時使用 Pandas 編寫更清潔和高性能的代碼

問題描述

樣本數據：

DataFrame：

Dataframe output：

我嘗試了什么：

期望：

編輯：

3 個解決方案

解決方案1
1 2021-06-07 08:35:21

解決方案2
1 2021-06-10 08:37:42

解決方案3
1 已采納 2021-06-10 15:51:32

如何在閱讀 CSV 時使用 Pandas 編寫更清潔和高性能的代碼

問題描述

樣本數據：

DataFrame：

Dataframe output：

我嘗試了什么：

期望：

編輯：

3 個解決方案

解決方案1 1 2021-06-07 08:35:21

解決方案2 1 2021-06-10 08:37:42

解決方案3 1 已采納 2021-06-10 15:51:32

解決方案1
1 2021-06-07 08:35:21

解決方案2
1 2021-06-10 08:37:42

解決方案3
1 已采納 2021-06-10 15:51:32