I'm looking to retrieve the number of records and columns within a dataset utilizing the APIs within Foundry. One of the APIs that I found that seems to display the number of records is ".../monocle/api/table/stats", however I don't see how to pass through the rid for a dataset.
I'm ultimately trying to get the total columns, records and size for all the datasets I manage in order to build a dashboard using Quiver or Slate to show the amount of data we manage within the Foundry platform.
You could use the following sample code to calculate statistics for a dataset:
import time
import requests
from urllib.parse import quote_plus
import json
def calculate_dataset_stats(token: str,
dataset_rid: str,
branch='master',
api_base='https://foundry-stack.com'
) -> dict:
"""
Calculates statistics for last transaction of a dataset in a branch
Args:
dataset_rid: the dataset rid
branch: branch of the dataset
Returns: a dictionary with statistics
"""
start_stats_calculation = requests.post(f"{api_base}/foundry-stats/api/stats/datasets/"
f"{dataset_rid}/branches/{quote_plus(branch)}",
headers={
'content-type': "application/json",
'authorization': f"Bearer {token}",
})
start_stats_calculation.raise_for_status()
metadata = start_stats_calculation.json()
transaction_rid = metadata['view']['endTransactionRid']
schema_id = metadata['view']['schemaId']
calculated_finished = False
maybe_stats = {
'status': 'FAILED'
}
while not calculated_finished:
response = requests.get(f"{api_base}/foundry-stats/api/stats/datasets/"
f"{dataset_rid}/branches/{quote_plus(branch)}",
headers={
'content-type': "application/json",
'authorization': f"Bearer {token}",
},
params={
'endTransactionRid': transaction_rid,
'schemaId': schema_id
})
response.raise_for_status()
maybe_stats = response.json()
if (maybe_stats['status'] == 'SUCCEEDED') or (maybe_stats['status'] == 'FAILED'):
calculated_finished = True
time.sleep(0.5)
if maybe_stats['status'] != 'SUCCEEDED':
raise ValueError(f'Stats Calculation failed for dataset {dataset_rid}. '
f'Failure handling not implemented.')
return maybe_stats['result']['succeededDatasetResult']['stats']
token = "eyJwb..."
dataset_rid = "ri.foundry.main.dataset.14703427-09ab-4c9c-b036-1234b34d150b"
stats = calculate_dataset_stats(token, dataset_rid)
print(json.dumps(stats, indent=4))
This other answer uses computeStats and getDatasetStats Foundry APIs. There's another API - getComputedDatasetStats - which gets your required stats and may even perform better.
According to my tests:
sizeInBytes
, but only if computeStats is not run. When I called the computeStats API, and it finished the job, sizeInBytes
became null. getDatasetStats showed null too.To get the row count, column count and dataset size you may try using something similar to this:
import requests
import json
def getComputedDatasetStats(token, dataset_rid, api_base='https://.....'):
response = requests.post(
url=f'{api_base}/foundry-stats/api/computed-stats-v2/get',
headers={
'content-type': 'application/json',
'Authorization': 'Bearer ' + token
},
data=json.dumps({
"datasetRid": dataset_rid,
"branch": "master"
})
)
return response.json()
token = 'eyJwb.....'
dataset_rid = 'ri.foundry.main.dataset.1d9ef04e-7ec6-456e-8326-1c64b1105431'
result = getComputedDatasetStats(token, dataset_rid)
# full resulting json:
# print(json.dumps(result, indent=4))
# required statistics:
print('size:', result['computedDatasetStats']['sizeInBytes'])
print('rows:', result['computedDatasetStats']['rowCount'])
print('cols:', len(result['computedDatasetStats']['columnStats']))
Example output:
size: 24
rows: 2
cols: 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.