简体   繁体   中英

How do I join six list of tuples into a pandas dataframe using the first value of each tuple as the key?

I'm testing out a service that has an api that can pull parsed 10K corporate data from. For each metric pulled (EBIT, cash, totalassets, etc) I store the quarterly date and the metric inside a tuple and each tuple inside a list. The results are six lists of 43 - 80 tuples. I would like a dataframe with a column for the corporate ticker, date, and metrics. How do I turn what I have (lists of tuples) into that?

Code below to pull the data (this is example so no charge):

import numpy as np
import json
import pandas as pd

content = requests.get(r'https://eodhistoricaldata.com/api/fundamentals/AAPL.US?api_token=OeAFFmMliFG5orCUuwAKQ8l4WWFQ67YX')

ebit_list = []
date_list = []
totalassets_list = []
cash_list = []
totalCurrentAssets_list = []
totalCurrentLiabilities_list = []


for i in content.json()['Financials']['Income_Statement']['quarterly']:

    try:
        ebit_list.append((i, float(content.json()['Financials']['Income_Statement']['quarterly'][i]['ebit'])))
    except:
        pass

    try:
        date_list.append(i)
    except:
        pass

    try:
        totalassets_list.append((i, float(content.json()['Financials']['Balance_Sheet']['quarterly'][i]['totalAssets'])))
    except:
        pass



for i in content.json()['Financials']['Balance_Sheet']['quarterly']:
    #print(i, float(content.json()['Financials']['Balance_Sheet']['quarterly']['2019-12-28']['totalCurrentLiabilities']))
    try:
        cash_list.append((i, float(content.json()['Financials']['Balance_Sheet']['quarterly'][i]['cash'])))
    except:
        pass

    try:
        totalCurrentAssets_list.append((i, float(content.json()['Financials']['Balance_Sheet']['quarterly'][i]['totalCurrentAssets'])))
    except:
        pass

    try:
        totalCurrentLiabilities_list.append((i, float(content.json()['Financials']['Balance_Sheet']['quarterly'][i]['totalCurrentLiabilities'])))
    except:
        pass

I would like a dataframe with all dates (meaning if a metric is missing, a zero is filled in) and following columns:

date , ebit , totalassets , cash , totalCurrentAssets , totalCurrentLiabilities

I'm not sure how to extract tuples and values inside each tuple though.

You can use map function in pandas.Series to match the dates with the data you need. This will insert NaN for cells that have no matching values which will make it easier to deal with missing data later. If you still want to fill zeros, you can use fillna

# Create a dataframe using date
df = pd.DataFrame({'date': date_list})

# To avoid the code getting messy in the next steps
stuff = {'ebit': ebit_list, 'totalassets': totalassets_list, 'cash': cash_list, 'totalCurrentAssets': totalCurrentAssets_list, 'totalCurrentLiabilities': totalCurrentLiabilities_list}

for name, values in stuff.items():
    value_dict = {t[0]: t[1] for t in values}   # t is each tuple in the list
    df[name] = df['date'].map(value_dict)       # map will match the correct date to the value 

# assuming you need the dataframe to be sorted by date
df['date'] = pd.to_datetime(df['date'])         # we should use actual numbers instead of date string
df.sort_values('date', inplace=True, ignore_index=True)

# if you want to fill 0s to missing values
# df.fillna(0, inplace=True)

ignore_index argument in sort_values is to make sure the indices are not jumbled up after sorting. If your pandas version is old, it might give a TypeError: sort_values() got an unexpected keyword argument 'ignore_index' when sorting. If so you should use the following to reset indices instead

df.sort_values('date', inplace=True)
df.reset_index(inplace=True)

At the end this is the df

         date          ebit   totalassets          cash  totalCurrentAssets  totalCurrentLiabilities
0  2000-03-31           NaN  7.007000e+09           NaN                 NaN             1.853000e+09
1  2000-06-30           NaN  6.932000e+09           NaN                 NaN             1.873000e+09
2  2000-09-30           NaN  6.803000e+09           NaN                 NaN             1.933000e+09
3  2000-12-31  0.000000e+00  5.986000e+09           NaN                 NaN             1.637000e+09
4  2001-03-31  0.000000e+00  6.130000e+09           NaN                 NaN             1.795000e+09
..        ...           ...           ...           ...                 ...                      ...
75 2018-12-29  2.334600e+10  3.737190e+11  4.477100e+10        1.408280e+11             1.082830e+11
76 2019-03-30  1.341500e+10  3.419980e+11  3.798800e+10        1.233460e+11             9.377200e+10
77 2019-06-29  1.154400e+10  3.222390e+11  5.053000e+10        1.349730e+11             8.970400e+10
78 2019-09-28  1.562500e+10  3.385160e+11  4.884400e+10        1.628190e+11             1.057180e+11
79 2019-12-28  2.556900e+10  3.406180e+11  3.977100e+10        1.632310e+11             1.021610e+11

I can't get your example to work, requests is undefined.

but here is some code that may do what you want:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd


def create_df(list_of_lists):
    pd.DataFrame({x[0]: pd.Series(x[1:]) for x in list of lists})

We can actually simplify this code quite a bit to get you the result you want (and make it easier to adjust in the future!)

The finished code is here, with more detailed explanations below:

import numpy as np
import json
import pandas as pd
import requests

content = requests.get(r'https://eodhistoricaldata.com/api/fundamentals/AAPL.US?api_token=OeAFFmMliFG5orCUuwAKQ8l4WWFQ67YX')

income_data = content.json()['Financials']['Income_Statement']['quarterly']
income = pd.DataFrame.from_dict(income_data).transpose().set_index("date")
income = income[['ebit']]

balance_data = content.json()['Financials']['Balance_Sheet']['quarterly']
balance = pd.DataFrame.from_dict(balance_data).transpose().set_index("date")
balance = balance[['totalAssets', 'cash', 'totalCurrentAssets', 'totalCurrentLiabilities']]

financials = income.merge(balance, left_index = True, right_index = True).fillna(0)

The financials DataFrame will look like this (only showing data from 2005-2009):

| date       |      ebit |   totalAssets |       cash |   totalCurrentAssets |   totalCurrentLiabilities |
|:-----------|----------:|--------------:|-----------:|---------------------:|--------------------------:|
| 2009-12-26 | 4.758e+09 |    5.3926e+10 | 7.609e+09  |           3.3332e+10 |                1.3097e+10 |
| 2009-09-26 | 0         |    4.7501e+10 | 5.263e+09  |           3.1555e+10 |                1.1506e+10 |
| 2009-06-27 | 1.732e+09 |    4.814e+10  | 5.605e+09  |           3.517e+10  |                1.6661e+10 |
| 2009-03-31 | 0         |    4.3237e+10 | 4.466e+09  |           0          |                1.3751e+10 |
| 2008-12-31 | 0         |    4.2787e+10 | 7.236e+09  |           0          |                1.4757e+10 |
| 2008-09-30 | 0         |    3.9572e+10 | 1.1875e+10 |           0          |                1.4092e+10 |
| 2008-06-30 | 0         |    3.1709e+10 | 9.373e+09  |           0          |                9.218e+09  |
| 2008-03-31 | 0         |    3.0471e+10 | 9.07e+09   |           0          |                9.634e+09  |
| 2007-12-31 | 0         |    3.0039e+10 | 9.162e+09  |           0          |                1.0535e+10 |
| 2007-09-30 | 0         |    2.5347e+10 | 9.352e+09  |           0          |                9.299e+09  |
| 2007-06-30 | 0         |    2.1647e+10 | 7.118e+09  |           0          |                6.992e+09  |
| 2007-03-31 | 0         |    1.8711e+10 | 7.095e+09  |           0          |                5.485e+09  |
| 2006-12-31 | 0         |    1.9461e+10 | 7.159e+09  |           0          |                7.337e+09  |
| 2006-09-30 | 0         |    1.7205e+10 | 6.392e+09  |           0          |                6.471e+09  |
| 2006-06-30 | 0         |    1.5114e+10 | 0          |           0          |                5.023e+09  |
| 2006-03-31 | 0         |    1.3911e+10 | 0          |           0          |                4.456e+09  |
| 2005-12-31 | 0         |    1.4181e+10 | 0          |           0          |                5.06e+09   |
| 2005-09-30 | 0         |    1.1551e+10 | 3.491e+09  |           0          |                3.484e+09  |
| 2005-06-30 | 0         |    1.0488e+10 | 0          |           0          |                3.123e+09  |
| 2005-03-31 | 0         |    1.0111e+10 | 0          |           0          |                3.352e+09  |

The result of content.json()['Financials']['Income_Statement']['quarterly'] is a dictionary with each key being the date and each value being a second dictionary with the column data.

{'2005-03-31': {'date': '2005-03-31',
                'filing_date': None,
                'currency_symbol': 'USD',
                'researchDevelopment': '120000000.00',
                ...},
'2005-06-30': {...},
...}

Since this is the case, you can actually load that dictionary directly into a pandas dataframe by using

pd.DataFrame.from_dict(income_data).transpose().set_index("date")

The transpose is necessary because of the structure of the JSON. Pandas expects a dictionary formatted like {'column name': data} . Since the keys are dates, you will initially get a DataFrame where the rows are labeled "totalAssets", "cash", etc. and the columns are dates. the transpose() command flips the rows and columns so it's in the format you need. The final .set_index("date") command is there to use the "date" data instead of the initial key date, for consistency and to name the index. It is completely optional

Now, this DataFrame will have every column from the JSON file, but you are only interested in a few. The code

income = income[['ebit']]

selects only the relevant columns from the data.

Since you are pulling data from two different sources, you do need to create two different tables. This has an additional benefit that you can more clearly see which columns are being pulled in from the 'Income Statement' and which are from the 'Balance Sheet'.

The final line

financials = income.merge(balance, left_index = True, right_index = True).fillna(0)

merges the two tables together using their indexes (in this case, the "date" column). fillna(0) ensures that any missing data is replaced by a zero value, as you requested.

If you end up needing to add another table, such as 'Cash_Flow', you would use the same lines of code to create the table and select the relevant columns, and add a second merge line:

cashflow_data = content.json()['Financials']['Balance_Sheet']['quarterly']
cashflow = pd.DataFrame.from_dict(cashflow_data).transpose().set_index("date")
cashflow = cashflow[['accountsPayable', 'liabilitiesAndStockholdersEquity']]
...
financials.merge(cashflow, left_index = True, right_index = True).fillna(0)

As a bonus tip, there is quite a lot of data in your source JSON! To see what columns are available to you in any given table, use the following:

cashflow.columns.sort_values()

to get an alphabetized list of the columns you can use:

      ['accountsPayable', 'accumulatedAmortization', 'accumulatedDepreciation',
       'accumulatedOtherComprehensiveIncome', 'additionalPaidInCapital',
       'capitalLeaseObligations', 'capitalSurpluse', 'cash',
       'cashAndShortTermInvestments', 'commonStock',
       'commonStockSharesOutstanding', 'commonStockTotalEquity',
       'currency_symbol', 'deferredLongTermAssetCharges',
       'deferredLongTermLiab', 'filing_date', 'goodWill', 'intangibleAssets',
       'inventory', 'liabilitiesAndStockholdersEquity', 'longTermDebt',
       'longTermDebtTotal', 'longTermInvestments', 'negativeGoodwill',
       'netReceivables', 'netTangibleAssets', 'nonCurrentAssetsTotal',
       'nonCurrentLiabilitiesOther', 'nonCurrentLiabilitiesTotal',
       'nonCurrrentAssetsOther', 'noncontrollingInterestInConsolidatedEntity',
       'otherAssets', 'otherCurrentAssets', 'otherCurrentLiab', 'otherLiab',
       'otherStockholderEquity', 'preferredStockRedeemable',
       'preferredStockTotalEquity', 'propertyPlantAndEquipmentGross',
       'propertyPlantEquipment', 'retainedEarnings',
       'retainedEarningsTotalEquity', 'shortLongTermDebt', 'shortTermDebt',
       'shortTermInvestments',
       'temporaryEquityRedeemableNoncontrollingInterests', 'totalAssets',
       'totalCurrentAssets', 'totalCurrentLiabilities', 'totalLiab',
       'totalPermanentEquity', 'totalStockholderEquity', 'treasuryStock',
       'warrants']

This is also extremely helpful when there is a misspelling in the data, such as in "capitalSurpluse" above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM