I cannot find this question answered elsewhere; I would like to do a SQL-like join in pandas but with the slight twist that one dataframe is stacked. I have created a dataframe A with a stacked column index from a csv file in pandas that looks as follows:
| | | 2013-01-04 | 2013-01-07 |
|----------:|-----:|-----------:|-----------:|
| Adj Close | OWW | NaN | NaN |
| Close | OXLC | 4.155157 | 4.147217 |
| | OXM | 40.318089 | 42.988800 |
| | OXY | 50.416079 | 62.934800 |
The original csv had repeated what is in the 1st column for every entry like so:
| | | 2013-01-04 | 2013-01-07 |
|----------:|-----:|-----------:|-----------:|
| Adj Close | OWW | NaN | NaN |
| Close | OXLC | 4.155157 | 4.147217 |
| Close | OXM | 40.318089 | 42.988800 |
| Close | OXY | 50.416079 | 62.934800 |
The original csv was the transposed version of this. Pandas chose to stack that when converting to dataframe. (I used this code: pd.read_csv(file, header = [0,1], index_col=0).T)
In another csv/dataframe BI have for all of those so-called ticker symbols another ID that I would rather like to use: CIK.
| CIK | Ticker | Name |
|---------|--------|------------------------------------------------|
| 1090872 | A | Agilent Technologies Inc |
| 4281 | AA | Alcoa Inc |
| 1332552 | AAACU | Asia Automotive Acquisition Corp |
| 1287145 | AABB | Asia Broadband Inc |
| 1024015 | AABC | Access Anytime Bancorp Inc |
| 1099290 | AAC | Sinocoking Coal & Coke Chemical Industries Inc |
| 1264707 | AACC | Asset Acceptance Capital Corp |
| 849116 | AACE | Ace Cash Express Inc |
| 1409430 | AAGC | All American Gold Corp |
| 948846 | AAI | Airtran Holdings Inc |
Desired output: I would like to have the CIK instead of the ticker in a new dataframe otherwise identical to A.
Now in SQL I could easily join on A.name_of_2nd_column = b.Ticker since the table would just have the entry in the 1st column repeated in every line (like the original csv) and the column would have a name but in pandas I cannot. I tried this code:
result = pd.merge(data, tix, how='left', left_on=[1] right_on=['Ticker'])
How do I tell pandas to use the 2nd column as the key and/or interpret the first column just as repeated values?
What you want is to transcode from one set of identifiers (tickers) to another (CIKs used in the SEC Edgar database I presume).
I would
A.index.names=('Data','Ticker')
A = A.reset_index()
transco = B.set_index('Ticker').CIK
A['CIK'] = A.Ticker.map(transco)
A = A.drop('Ticker', axis=1).set_index(['Data','CIK'])
As a step 2.5, you might want to remove the entries for which you don't have any CIK, eg by doing:
A = A[A.CIK.notnull()]
A.CIK = A.CIK.astype(int)
You could also merge after doing a reset_index()
, but I would avoid that, as you might end up with uselessly large dataframes, as the result of the merge will have a names column. This can grow if you have many different types of data (Adj Close, Close, etc.).
I was eventually able to do it the following way:
df = A
tix = B
ticker_2_CIK = dict(zip(tix.Ticker,tix.CIK)) # create a dict
tmp = df.reset_index().assign(CIK=lambda x: x['ticker'].map(ticker_2_CIK)) # use dict to find the correct value for colum
# data was unclean, some ticker symbols were created after the period my data is from
# and data was incomplete with some tickers missing
solution = tmp.dropna(subset=['CIK']).astype({'CIK':int})
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.