如何在 pandas 中合并或加入堆叠的 dataframe？

Question

我在其他地方找不到这个问题的答案； 我想在 pandas 中做一个类似 SQL 的连接，但稍微有点扭曲的是，一个 dataframe 是堆叠的。 我从 pandas 中的 csv 文件创建了一个带有堆叠列索引的 dataframe A，如下所示：

|           |      | 2013-01-04 | 2013-01-07 |
|----------:|-----:|-----------:|-----------:|
| Adj Close |  OWW | NaN        | NaN        |
|   Close   | OXLC | 4.155157   | 4.147217   |
|           |  OXM | 40.318089  | 42.988800  |
|           |  OXY | 50.416079  | 62.934800  |

原始 csv 重复了每个条目的第一列中的内容，如下所示：

|           |      | 2013-01-04 | 2013-01-07 |
|----------:|-----:|-----------:|-----------:|
| Adj Close |  OWW | NaN        | NaN        |
|   Close   | OXLC | 4.155157   | 4.147217   |
|   Close   |  OXM | 40.318089  | 42.988800  |
|   Close   |  OXY | 50.416079  | 62.934800  |

原来的 csv 是这个的转置版本。 Pandas 在转换为 dataframe 时选择了堆叠。 （我使用了这个代码：pd.read_csv(file, header = [0,1], index_col=0).T）

在另一个 csv/dataframe 中，对于所有那些所谓的股票代码，BI 有另一个我更愿意使用的 ID：CIK。

| CIK     | Ticker | Name                                           |
|---------|--------|------------------------------------------------|
| 1090872 | A      | Agilent Technologies Inc                       |
| 4281    | AA     | Alcoa Inc                                      |
| 1332552 | AAACU  | Asia Automotive Acquisition Corp               |
| 1287145 | AABB   | Asia Broadband Inc                             |
| 1024015 | AABC   | Access Anytime Bancorp Inc                     |
| 1099290 | AAC    | Sinocoking Coal & Coke Chemical Industries Inc |
| 1264707 | AACC   | Asset Acceptance Capital Corp                  |
| 849116  | AACE   | Ace Cash Express Inc                           |
| 1409430 | AAGC   | All American Gold Corp                         |
| 948846  | AAI    | Airtran Holdings Inc                           |

所需的 output：我想在新的 dataframe 中使用 CIK 而不是代码，否则与 A 相同。

现在在 SQL 中，我可以轻松地加入 A.name_of_2nd_column = b.Ticker，因为该表只会在每一行中重复第一列中的条目（如原始 csv），并且该列将有一个名称，但在 pandas 中我不能。 我试过这段代码：

result = pd.merge(data, tix, how='left', left_on=[1] right_on=['Ticker'])

如何告诉 pandas 使用第二列作为键和/或将第一列解释为重复值？

Answer 1

您想要的是将一组标识符（代码）转码为另一组（我认为是 SEC Edgar 数据库中使用的 CIK）。

我会

将索引列转换为普通列，特别是如果这些是多索引，可能在重命名索引列之后

A.index.names=('Data','Ticker')
A = A.reset_index()

使用 map 方法将代码转码为 CIK

transco = B.set_index('Ticker').CIK
A['CIK'] = A.Ticker.map(transco)

最终从你想要的重新索引，删除未使用的索引

A = A.drop('Ticker', axis=1).set_index(['Data','CIK'])

作为步骤 2.5，您可能希望删除没有任何 CIK 的条目，例如通过执行以下操作：

A = A[A.CIK.notnull()]
A.CIK = A.CIK.astype(int)

您也可以在执行reset_index()之后合并，但我会避免这种情况，因为您最终可能会得到无用的大数据帧，因为合并的结果将有一个 names 列。 如果您有许多不同类型的数据（Adj Close、Close 等），这可能会增加。

Answer 2

我最终能够通过以下方式做到这一点：

df = A
tix = B

ticker_2_CIK = dict(zip(tix.Ticker,tix.CIK))  # create a dict

tmp = df.reset_index().assign(CIK=lambda x: x['ticker'].map(ticker_2_CIK)) # use dict to find the correct value for colum 

# data was unclean, some ticker symbols were created after the period my data is from 
# and data was incomplete with some tickers missing
solution = tmp.dropna(subset=['CIK']).astype({'CIK':int})

如何在 pandas 中合并或加入堆叠的 dataframe？

问题描述

2 个解决方案

解决方案1
1 2020-05-25 22:43:15

解决方案2
0 已采纳 2020-06-03 22:54:44

如何在 pandas 中合并或加入堆叠的 dataframe？

问题描述

2 个解决方案

解决方案1 1 2020-05-25 22:43:15

解决方案2 0 已采纳 2020-06-03 22:54:44

解决方案1
1 2020-05-25 22:43:15

解决方案2
0 已采纳 2020-06-03 22:54:44