[英]Iterating through rows in a specific column to add rows to a new column in dataframe
我有一個 function 用於轉換為 dataframe {Tablenames: list of dicts of table columns: column datatype} 作為glue_table_cols例如。 'dev_public_cataloguing_crawler_tbl_version_size_listings': [{'Name': 'item_name', 'Type': 'string'}]
我想通過遍歷“表”列並根據{Database: list of tables}
的另一個字典的內容檢查每一行作為get_tables並添加該表所屬的正確數據庫來添加新列“數據庫”新專欄:
def table_pandaframe(gdict):
columns = []
tbls = get_tables()
for tbl, col in gdict.items():
df = pd.DataFrame(col)
df['Table'] = tbl
columns.append(df)
gdf = pd.concat(columns)
for i, row in gdf.iterrows():
for d, t in tbls.items():
if gdf[i, 'Table'] in tbls[d]:
gdf.at[i, 'Database'] = d
gdf = gdf.reset_index(drop=True)
gdf = gdf[["Database", "Table", "Name", "Type"]]
gdf.rename(columns={'Name':'Column','Type':'DataType'},inplace=True)
return gdf
cat = glue_table_cols()
print(table_pandaframe(cat))
當我運行它時出現此錯誤:
KeyError: (0, 'Table')
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-62-ba874d9958ef> in <module>
19
20 cat = glue_table_csv()
---> 21 print(table_pandaframe(cat))
<ipython-input-62-ba874d9958ef> in table_pandaframe(gdict)
9 for i, row in gdf.iterrows():
10 for d, t in tbls.items():
---> 11 if gdf[i, 'Table'] in tbls[d]:
12 gdf.at[i, 'Database'] = d
C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
3022 if self.columns.nlevels > 1:
3023 return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
3025 if is_integer(indexer):
3026 indexer = [indexer]
C:\ProgramData\Miniconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:
KeyError: (0, 'Table')
所以我嘗試將第 11 行的 if 語句替換為
if row['Table'] in tbls[d]:
但 output 為此:
Database Table \
0 workspace_optimiser aims_db_migrationshistory
1 workspace_optimiser aims_db_migrationshistory
2 workspace_optimiser aims_db_activities
3 workspace_optimiser aims_db_activities
4 workspace_optimiser aims_db_activities
Column DataType
0 migrationid string
1 prodversion string
2 enddatetime timestamp
3 activityid int
4 costcodeid int
我最終將數據庫列中的所有行都填充了 get_tables 字典中的最后一個數據庫,如何確保每一行都有其特定的數據庫?
編輯:MRE 示例
get_tables() = {'sandbox_redshift': ['dev_public_cataloguing_crawler_tbl_listings', 'dev_public_cataloguing_crawler_tbl_version_tagging_listings', 'dev_public_cataloguing_tbl_version_size_listing', 'dev_public_cataloguing_tbl_version_tagging_listing'], 'timesheetportal': [], 'w00-develop-processed-database': ['processed'], 'workspace_optimiser': ['bb_datalake_workspace_optimiser_output']}
glue_table_cols() = {'dev_public_cataloguing_tbl_version_tagging_listing': [{'Name': 'tag_value', 'Type': 'string'}, {'Name': 'tag_name', 'Type': 'string'}, {'Name': 'tag_type', 'Type': 'string'}, {'Name': 'name', 'Type': 'string'}, {'Name': 'version_id', 'Type': 'string'}],
'processed': [{'Name': 'role_id', 'Type': 'bigint'}, {'Name': 'application_id', 'Type': 'bigint'}, {'Name': 'role_name', 'Type': 'string'}, {'Name': 'max_per_organisation', 'Type': 'double'}, {'Name': 'is_hidden_in_organisation', 'Type': 'boolean'}, {'Name': 'extracted_utc', 'Type': 'timestamp'}],
'bb_datalake_workspace_optimiser_output': [{'Name': 'workspaceid', 'Type': 'string'}, {'Name': 'billable hours', 'Type': 'bigint'}, {'Name': 'usage threshold', 'Type': 'bigint'}, {'Name': 'change reported', 'Type': 'string'}, {'Name': 'bundle type', 'Type': 'string'}, {'Name': 'initial mode', 'Type': 'string'}, {'Name': 'new mode', 'Type': 'string'}, {'Name': 'username', 'Type': 'string'}, {'Name': 'connectedtime', 'Type': 'bigint'}]}
使用pd.concat
時必須忽略索引,否則存在重復索引並使用iloc
:
def table_pandaframe(gdict):
columns = []
tbls = get_tables()
for tbl, col in gdict.items():
df = pd.DataFrame(col)
df['Table'] = tbl
columns.append(df)
gdf = pd.concat(columns, ignore_index=True) # <- ignore_index
for i, row in gdf.iterrows():
for d, t in tbls.items():
if gdf.loc[i, 'Table'] in tbls[d]: # <- .loc
gdf.at[i, 'Database'] = d
gdf = gdf.reset_index(drop=True)
gdf = gdf[["Database", "Table", "Name", "Type"]]
gdf.rename(columns={'Name':'Column','Type':'DataType'},inplace=True)
return gdf
cat = glue_table_cols()
print(table_pandaframe(cat))
Output:
Database Table Column DataType
0 sandbox_redshift dev_public_cataloguing_tbl_version_tagging_lis... tag_value string
1 sandbox_redshift dev_public_cataloguing_tbl_version_tagging_lis... tag_name string
2 sandbox_redshift dev_public_cataloguing_tbl_version_tagging_lis... tag_type string
3 sandbox_redshift dev_public_cataloguing_tbl_version_tagging_lis... name string
4 sandbox_redshift dev_public_cataloguing_tbl_version_tagging_lis... version_id string
5 w00-develop-processed-database processed role_id bigint
6 w00-develop-processed-database processed application_id bigint
7 w00-develop-processed-database processed role_name string
8 w00-develop-processed-database processed max_per_organisation double
9 w00-develop-processed-database processed is_hidden_in_organisation boolean
10 w00-develop-processed-database processed extracted_utc timestamp
11 workspace_optimiser bb_datalake_workspace_optimiser_output workspaceid string
12 workspace_optimiser bb_datalake_workspace_optimiser_output billable hours bigint
13 workspace_optimiser bb_datalake_workspace_optimiser_output usage threshold bigint
14 workspace_optimiser bb_datalake_workspace_optimiser_output change reported string
15 workspace_optimiser bb_datalake_workspace_optimiser_output bundle type string
16 workspace_optimiser bb_datalake_workspace_optimiser_output initial mode string
17 workspace_optimiser bb_datalake_workspace_optimiser_output new mode string
18 workspace_optimiser bb_datalake_workspace_optimiser_output username string
19 workspace_optimiser bb_datalake_workspace_optimiser_output connectedtime bigint
編輯:沒有ignore_index=True
>>> gdf.loc['Table']
0 dev_public_cataloguing_tbl... # <- dup 0
1 dev_public_cataloguing_tbl...
2 dev_public_cataloguing_tbl...
3 dev_public_cataloguing_tbl...
4 dev_public_cataloguing_tbl...
0 processed # <- dup 0
1 processed
2 processed
3 processed
4 processed
5 processed
0 bb_datalake_workspace_opti... # <- dup 0
1 bb_datalake_workspace_opti...
2 bb_datalake_workspace_opti...
3 bb_datalake_workspace_opti...
4 bb_datalake_workspace_opti...
5 bb_datalake_workspace_opti...
6 bb_datalake_workspace_opti...
7 bb_datalake_workspace_opti...
8 bb_datalake_workspace_opti...
Name: Table, dtype: object
對於i = 0
和d = 'sandbox_redshift'
:
>>> gdf.loc[i, 'Table'] in tbls[d]
ValueError: The truth value of a Series is ambiguous...
>>> gdf.loc[i, 'Table']
0 dev_public_cataloguing_tbl... # <- dup 0
0 processed # <- dup 0
0 bb_datalake_workspace_opti... # <- dup 0
Name: Table, dtype: object
>>> tbls[d]
['dev_public_cataloguing_crawler_tbl_listings',
'dev_public_cataloguing_crawler_tbl_version_tagging_listings',
'dev_public_cataloguing_tbl_version_size_listing',
'dev_public_cataloguing_tbl_version_tagging_listing']
因此,您嘗試做的是[] in []
。 in
中的語句無法做到這一點,但它可以scalar in []
。 這就是為什么您的索引應該是唯一的才能成為標量的原因。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.