简体   繁体   中英

Utilizing Foundry APIs, how do you get the number or rows and columns for a dataset?

I'm looking to retrieve the number of records and columns within a dataset utilizing the APIs within Foundry. One of the APIs that I found that seems to display the number of records is ".../monocle/api/table/stats", however I don't see how to pass through the rid for a dataset.

I'm ultimately trying to get the total columns, records and size for all the datasets I manage in order to build a dashboard using Quiver or Slate to show the amount of data we manage within the Foundry platform.

You could use the following sample code to calculate statistics for a dataset:

import time
import requests
from urllib.parse import quote_plus
import json

def calculate_dataset_stats(token: str,
                            dataset_rid: str,
                            branch='master',
                            api_base='https://foundry-stack.com'
                            ) -> dict:
    """
    Calculates statistics for last transaction of a dataset in a branch
    Args:
        dataset_rid: the dataset rid
        branch: branch of the dataset

    Returns: a dictionary with statistics

    """
    start_stats_calculation = requests.post(f"{api_base}/foundry-stats/api/stats/datasets/"
                                            f"{dataset_rid}/branches/{quote_plus(branch)}",
                                            headers={
                                                'content-type': "application/json",
                                                'authorization': f"Bearer {token}",
                                            })
    start_stats_calculation.raise_for_status()
    metadata = start_stats_calculation.json()
    transaction_rid = metadata['view']['endTransactionRid']
    schema_id = metadata['view']['schemaId']

    calculated_finished = False
    maybe_stats = {
        'status': 'FAILED'
    }

    while not calculated_finished:
        response = requests.get(f"{api_base}/foundry-stats/api/stats/datasets/"
                                f"{dataset_rid}/branches/{quote_plus(branch)}",
                                headers={
                                    'content-type': "application/json",
                                    'authorization': f"Bearer {token}",
                                },
                                params={
                                    'endTransactionRid': transaction_rid,
                                    'schemaId': schema_id
                                })
        response.raise_for_status()
        maybe_stats = response.json()
        if (maybe_stats['status'] == 'SUCCEEDED') or (maybe_stats['status'] == 'FAILED'):
            calculated_finished = True
        time.sleep(0.5)

    if maybe_stats['status'] != 'SUCCEEDED':
        raise ValueError(f'Stats Calculation failed for dataset {dataset_rid}. '
                         f'Failure handling not implemented.')

    return maybe_stats['result']['succeededDatasetResult']['stats']


token = "eyJwb..."
dataset_rid = "ri.foundry.main.dataset.14703427-09ab-4c9c-b036-1234b34d150b"
stats = calculate_dataset_stats(token, dataset_rid)

print(json.dumps(stats, indent=4))

This other answer uses computeStats and getDatasetStats Foundry APIs. There's another API - getComputedDatasetStats - which gets your required stats and may even perform better.

According to my tests:

  • getDatasetStats is not available unless computeStats is run. The latter takes time. On the other hand, getComputedDatasetStats is available right away.
  • getComputedDatasetStats will return sizeInBytes , but only if computeStats is not run. When I called the computeStats API, and it finished the job, sizeInBytes became null. getDatasetStats showed null too.

To get the row count, column count and dataset size you may try using something similar to this:

import requests
import json

def getComputedDatasetStats(token, dataset_rid, api_base='https://.....'):
    response = requests.post(
        url=f'{api_base}/foundry-stats/api/computed-stats-v2/get',
        headers={
            'content-type': 'application/json',
            'Authorization': 'Bearer ' + token
        },
        data=json.dumps({
            "datasetRid": dataset_rid,
            "branch": "master"
        })
    )
    return response.json()

token = 'eyJwb.....'
dataset_rid = 'ri.foundry.main.dataset.1d9ef04e-7ec6-456e-8326-1c64b1105431'

result = getComputedDatasetStats(token, dataset_rid)

# full resulting json:
# print(json.dumps(result, indent=4))

# required statistics:
print('size:', result['computedDatasetStats']['sizeInBytes'])
print('rows:', result['computedDatasetStats']['rowCount'])
print('cols:', len(result['computedDatasetStats']['columnStats']))

Example output:

size: 24
rows: 2
cols: 2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM