简体   繁体   中英

How to know the total no: of rows in our SQL query result if it is greater than 1 million in Azure databricks notebook?

I am trying to query a huge data set and I am able to download only 1 million rows worth of data. I want to know how much data there is, as a part of the query result.

Assuming your DataFrame / query is under df variable:

df.count()

will give you the number of rows and

len(df.columns)

will give you number of columns. You can further explore the schema of the data with df.printSchema()

1 million rows is generally not that many and while the job has to be run to count all of them it shouldn't be any problem.

There are three solutions to realize your needs, as my sample code below.

  1. To mount your container of Azure Blob Storage to Azure Databrick Filesystem, please follow the section Mount Azure Blob Storage containers to DBFS of the offical document Data Sources > Azure Blob Storage . And here is my sample code below.

     storage_account_name='<your storage account name>' storage_account_access_key='<your storage account key>' container_name = '<your container name>' dbutils.fs.mount( source = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net", mount_point = "/mnt/<a mount directory name under /mnt, such as `test`>", extra_configs = {"fs.azure.account.key."+storage_account_name+".blob.core.windows.net":storage_account_access_key}) df = spark.read.csv('/mnt/<your mount directory name under /mnt, such as `test`>/df.csv') lines_num = df.count() columns_num = len(df.columns) print(lines_num , columns_num ) 
  2. To directly access the csv file from Azure Blob Storage, please follow the other section Access Azure Blob Storage directly of the offical document Data Sources > Azure Blob Storage . And here is my sample code below.

     storage_account_name='<your storage account name>' storage_account_access_key='<your storage account key>' container_name = '<your container name>' spark.conf.set( "fs.azure.account.key."+storage_account_name+".blob.core.windows.net", storage_account_access_key) blob_name = '<your csv blob name>' url = "wasbs://"+container_name+"@"+storage_account_name+".blob.core.windows.net/"+blob_name df = spark.read.csv(url) lines_num = df.count() columns_num = len(df.columns) print(lines_num , columns_num ) 

The two solutions above as @Daniel said is using the Spark DataFrame functions to do it. For a huge data set, it will may cost huge memory in the cluster servers of Azure Databricks. So you can consider for my next solution.

  1. To count the stream of http response body of the csv blob url with SAS token by Azure Storage SDK for Python. First, you need to install the package azure-storage in the cluster of Azure Databricks.

    在此处输入图片说明

    在此处输入图片说明

    My sample code as below.

     from azure.storage.blob.baseblobservice import BaseBlobService from azure.storage.blob import BlobPermissions from datetime import datetime, timedelta account_name = '<your account name>' account_key = '<your account key>' container_name = '<your container name>' blob_name = '<your blob name>' blob_service = BaseBlobService( account_name=account_name, account_key=account_key ) sas_token = blob_service.generate_blob_shared_access_signature(container_name, blob_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1)) blob_url_with_sas = blob_service.make_blob_url(container_name, blob_name, sas_token=sas_token) import urllib.request resp =urllib.request.urlopen(blob_url_with_sas) is_first = True lines_num, columns_num = 0, 0 for line in resp: if is_first: columns_num = len(line.split(b',')) is_first = False lines_num += 1 print(lines_num , columns_num) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM