I have been provided a very large dictionary with the following format that I am unsure how to convert to a dataframe that I can use to perform basic functions on.
{
'hash': {
'ids': [List of Unique IDs of records this hash has been seen in],
'weights': [List of weights],
'values': [List of values],
'measure_dates': [List of dates]
}
}
The number of items in ids
, weights
, values
and measure_dates
is the same within a hash
. Different hash
es can have a different number of items though. It depends on how often a measurement is taken.
Real(ish) data for an example of three records:
{
'IRR-99876-UTY': {
'ids': [9912234, 9912237, 45555889],
'weights': [0.09, 0.09, 0.113],
'values': [2.31220, 2.31219, 2.73944],
'measure_dates': ['2021-10-14', '2021-10-15', '2022-12-17']
},
'IRR-10881-CKZ': {
'ids': [45557231],
'weights': [0.31],
'values': [5.221001],
'measure_dates': ['2022-12-31']
},
'IRR-881-CKZ': {
'ids': [24661, 24662, 29431],
'weights': [0.05, 0.07, 0.105],
'values': [3.254, 4.500001, 7.3221],
'measure_dates': ['2018-05-05', '2018-05-06', '2018-07-01']
}
}
The value in an index corresponds to the same measurement being taken. For example in IRR-881-CKZ
, there are 3 measurements.
24661
, weight 0.05
, and value 3.254
24662
, weight 0.07
and value 4.500001
29431
, weight 0.105
and value 7.3221
No other combination of indexes is valid for this hash.
Information that I'm going to be attempting to get data on:
ids
list. In this example, the first and third record have three items so would be the top results. I'd love to be able to use something like nlargest()
or sort_values().head()
to get this, instead of parsing each record and counting the number of items.df['average'] = df[['value1', 'value2']].mean(axis=1)
, but with a variable number of values I'm not sure how to do this.How can I convert this dictionary of dictionaries of lists to a usable dataframe?
You can use .from_dict()
in pandas to convert it to a dataframe.
import pandas as pd
# dictionary of dictionaries with list values
data = {
'IRR-99876-UTY': {
'ids': [9912234, 9912237, 45555889],
'weights': [0.09, 0.09, 0.113],
'values': [2.31220, 2.31219, 2.73944],
'measure_dates': ['2021-10-14', '2021-10-15', '2022-12-17']
},
'IRR-10881-CKZ': {
'ids': [45557231],
'weights': [0.31],
'values': [5.221001],
'measure_dates': ['2022-12-31']
},
'IRR-881-CKZ': {
'ids': [24661, 24662, 29431],
'weights': [0.05, 0.07, 0.105],
'values': [3.254, 4.500001, 7.3221],
'measure_dates': ['2018-05-05', '2018-05-06', '2018-07-01']
}
}
# convert to data frame
df = pd.DataFrame.from_dict(data, orient='index')
You'll need to convert each entry of this dictionary into its own DataFrame and concatenate those to effectively work with this data:
import pandas as pd
data = {
'IRR-99876-UTY': {
'ids': [9912234, 9912237, 45555889],
'weights': [0.09, 0.09, 0.113],
'values': [2.31220, 2.31219, 2.73944],
'measure_dates': ['2021-10-14', '2021-10-15', '2022-12-17']
},
'IRR-10881-CKZ': {
'ids': [45557231],
'weights': [0.31],
'values': [5.221001],
'measure_dates': ['2022-12-31']
},
'IRR-881-CKZ': {
'ids': [24661, 24662, 29431],
'weights': [0.05, 0.07, 0.105],
'values': [3.254, 4.500001, 7.3221],
'measure_dates': ['2018-05-05', '2018-05-06', '2018-07-01']
}
}
df = pd.concat(
{k: pd.DataFrame(v) for k, v in data.items()},
names=['hash', 'obs']
)
print(df)
ids weights values measure_dates
hash obs
IRR-99876-UTY 0 9912234 0.090 2.312200 2021-10-14
1 9912237 0.090 2.312190 2021-10-15
2 45555889 0.113 2.739440 2022-12-17
IRR-10881-CKZ 0 45557231 0.310 5.221001 2022-12-31
IRR-881-CKZ 0 24661 0.050 3.254000 2018-05-05
1 24662 0.070 4.500001 2018-05-06
2 29431 0.105 7.322100 2018-07-01
Now that our data is cleaned up we can solve your questions.
Series.value_counts
operation. However since the data we're interested in is currently in the index we'll need to grab it out using Index.get_level_values
first.Series.between
method to check whether a those averages exist between two arbitrary values.# Which hash(es) are measured the most often.
df.index.get_level_values('hash').value_counts()
# IRR-99876-UTY 3
# IRR-881-CKZ 3
# IRR-10881-CKZ 1
# Name: hash, dtype: int64
# ---
# Which hashes have an average value between two values.
## Here you can see that I'm testing whether the average is between 0 and 4
print(df.groupby('hash')['values'].mean().between(0, 4))
# IRR-10881-CKZ False
# IRR-881-CKZ False
# IRR-99876-UTY True
# Name: values, dtype: bool
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.