簡體   English   中英

遍歷特定列中的行以將行添加到 dataframe 中的新列

[英]Iterating through rows in a specific column to add rows to a new column in dataframe

我有一個 function 用於轉換為 dataframe {Tablenames: list of dicts of table columns: column datatype} 作為glue_table_cols例如。 'dev_public_cataloguing_crawler_tbl_version_size_listings': [{'Name': 'item_name', 'Type': 'string'}]

我想通過遍歷“表”列並根據{Database: list of tables}的另一個字典的內容檢查每一行作為get_tables並添加該表所屬的正確數據庫來添加新列“數據庫”新專欄:

def table_pandaframe(gdict):
    columns = []
    tbls = get_tables()
    for tbl, col in gdict.items():
        df = pd.DataFrame(col)
        df['Table'] = tbl
        columns.append(df)
    gdf = pd.concat(columns)
    for i, row in gdf.iterrows():
        for d, t in tbls.items():
            if gdf[i, 'Table'] in tbls[d]:
                gdf.at[i, 'Database'] = d
    gdf = gdf.reset_index(drop=True)
    gdf = gdf[["Database", "Table", "Name", "Type"]]
    gdf.rename(columns={'Name':'Column','Type':'DataType'},inplace=True)

    return gdf

cat = glue_table_cols()
print(table_pandaframe(cat))

當我運行它時出現此錯誤:

KeyError: (0, 'Table')

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-62-ba874d9958ef> in <module>
     19 
     20 cat = glue_table_csv()
---> 21 print(table_pandaframe(cat))

<ipython-input-62-ba874d9958ef> in table_pandaframe(gdict)
      9     for i, row in gdf.iterrows():
     10         for d, t in tbls.items():
---> 11             if gdf[i, 'Table'] in tbls[d]:
     12                 gdf.at[i, 'Database'] = d

C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   3022             if self.columns.nlevels > 1:
   3023                 return self._getitem_multilevel(key)
-> 3024             indexer = self.columns.get_loc(key)
   3025             if is_integer(indexer):
   3026                 indexer = [indexer]

C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083 
   3084         if tolerance is not None:

KeyError: (0, 'Table')

所以我嘗試將第 11 行的 if 語句替換為

if row['Table'] in tbls[d]:

但 output 為此:

                 Database                                   Table  \
0     workspace_optimiser               aims_db_migrationshistory   
1     workspace_optimiser               aims_db_migrationshistory   
2     workspace_optimiser                      aims_db_activities   
3     workspace_optimiser                      aims_db_activities   
4     workspace_optimiser                      aims_db_activities 
                 Column   DataType  
0            migrationid     string  
1            prodversion     string  
2            enddatetime  timestamp  
3             activityid        int  
4             costcodeid        int  

我最終將數據庫列中的所有行都填充了 get_tables 字典中的最后一個數據庫,如何確保每一行都有其特定的數據庫?

編輯:MRE 示例

get_tables() = {'sandbox_redshift': ['dev_public_cataloguing_crawler_tbl_listings',   'dev_public_cataloguing_crawler_tbl_version_tagging_listings',  'dev_public_cataloguing_tbl_version_size_listing', 'dev_public_cataloguing_tbl_version_tagging_listing'], 'timesheetportal': [], 'w00-develop-processed-database': ['processed'], 'workspace_optimiser': ['bb_datalake_workspace_optimiser_output']}

glue_table_cols() = {'dev_public_cataloguing_tbl_version_tagging_listing': [{'Name': 'tag_value', 'Type': 'string'}, {'Name': 'tag_name', 'Type': 'string'}, {'Name': 'tag_type', 'Type': 'string'}, {'Name': 'name', 'Type': 'string'}, {'Name': 'version_id', 'Type': 'string'}],
 'processed': [{'Name': 'role_id', 'Type': 'bigint'}, {'Name': 'application_id', 'Type': 'bigint'}, {'Name': 'role_name', 'Type': 'string'}, {'Name': 'max_per_organisation', 'Type': 'double'}, {'Name': 'is_hidden_in_organisation', 'Type': 'boolean'}, {'Name': 'extracted_utc', 'Type': 'timestamp'}],
 'bb_datalake_workspace_optimiser_output': [{'Name': 'workspaceid', 'Type': 'string'}, {'Name': 'billable hours', 'Type': 'bigint'}, {'Name': 'usage threshold', 'Type': 'bigint'}, {'Name': 'change reported', 'Type': 'string'}, {'Name': 'bundle type', 'Type': 'string'}, {'Name': 'initial mode', 'Type': 'string'}, {'Name': 'new mode', 'Type': 'string'}, {'Name': 'username', 'Type': 'string'}, {'Name': 'connectedtime', 'Type': 'bigint'}]}

使用pd.concat時必須忽略索引,否則存在重復索引並使用iloc

def table_pandaframe(gdict):
    columns = []
    tbls = get_tables()
    for tbl, col in gdict.items():
        df = pd.DataFrame(col)
        df['Table'] = tbl
        columns.append(df)
    gdf = pd.concat(columns, ignore_index=True)  # <- ignore_index
    for i, row in gdf.iterrows():
        for d, t in tbls.items():
            if gdf.loc[i, 'Table'] in tbls[d]:  # <- .loc
                gdf.at[i, 'Database'] = d
    gdf = gdf.reset_index(drop=True)
    gdf = gdf[["Database", "Table", "Name", "Type"]]
    gdf.rename(columns={'Name':'Column','Type':'DataType'},inplace=True)

    return gdf

cat = glue_table_cols()
print(table_pandaframe(cat))

Output:

                          Database                                              Table                     Column   DataType
0                 sandbox_redshift  dev_public_cataloguing_tbl_version_tagging_lis...                  tag_value     string
1                 sandbox_redshift  dev_public_cataloguing_tbl_version_tagging_lis...                   tag_name     string
2                 sandbox_redshift  dev_public_cataloguing_tbl_version_tagging_lis...                   tag_type     string
3                 sandbox_redshift  dev_public_cataloguing_tbl_version_tagging_lis...                       name     string
4                 sandbox_redshift  dev_public_cataloguing_tbl_version_tagging_lis...                 version_id     string
5   w00-develop-processed-database                                          processed                    role_id     bigint
6   w00-develop-processed-database                                          processed             application_id     bigint
7   w00-develop-processed-database                                          processed                  role_name     string
8   w00-develop-processed-database                                          processed       max_per_organisation     double
9   w00-develop-processed-database                                          processed  is_hidden_in_organisation    boolean
10  w00-develop-processed-database                                          processed              extracted_utc  timestamp
11             workspace_optimiser             bb_datalake_workspace_optimiser_output                workspaceid     string
12             workspace_optimiser             bb_datalake_workspace_optimiser_output             billable hours     bigint
13             workspace_optimiser             bb_datalake_workspace_optimiser_output            usage threshold     bigint
14             workspace_optimiser             bb_datalake_workspace_optimiser_output            change reported     string
15             workspace_optimiser             bb_datalake_workspace_optimiser_output                bundle type     string
16             workspace_optimiser             bb_datalake_workspace_optimiser_output               initial mode     string
17             workspace_optimiser             bb_datalake_workspace_optimiser_output                   new mode     string
18             workspace_optimiser             bb_datalake_workspace_optimiser_output                   username     string
19             workspace_optimiser             bb_datalake_workspace_optimiser_output              connectedtime     bigint

編輯:沒有ignore_index=True

>>> gdf.loc['Table']
0    dev_public_cataloguing_tbl...  # <- dup 0
1    dev_public_cataloguing_tbl...
2    dev_public_cataloguing_tbl...
3    dev_public_cataloguing_tbl...
4    dev_public_cataloguing_tbl...
0                        processed  # <- dup 0
1                        processed
2                        processed
3                        processed
4                        processed
5                        processed
0    bb_datalake_workspace_opti...  # <- dup 0
1    bb_datalake_workspace_opti...
2    bb_datalake_workspace_opti...
3    bb_datalake_workspace_opti...
4    bb_datalake_workspace_opti...
5    bb_datalake_workspace_opti...
6    bb_datalake_workspace_opti...
7    bb_datalake_workspace_opti...
8    bb_datalake_workspace_opti...
Name: Table, dtype: object

對於i = 0d = 'sandbox_redshift'

>>> gdf.loc[i, 'Table'] in tbls[d]
ValueError: The truth value of a Series is ambiguous...

>>> gdf.loc[i, 'Table']
0    dev_public_cataloguing_tbl...  # <- dup 0
0                        processed  # <- dup 0
0    bb_datalake_workspace_opti...  # <- dup 0
Name: Table, dtype: object

>>> tbls[d]
['dev_public_cataloguing_crawler_tbl_listings',
 'dev_public_cataloguing_crawler_tbl_version_tagging_listings',
 'dev_public_cataloguing_tbl_version_size_listing',
 'dev_public_cataloguing_tbl_version_tagging_listing']

因此,您嘗試做的是[] in [] in中的語句無法做到這一點,但它可以scalar in [] 這就是為什么您的索引應該是唯一的才能成為標量的原因。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM