简体   繁体   中英

Create Tabular Dataset in Azure using python sdk

So I'm just starting with Azure and I have this problem:

Here is my code:

def getWorkspace(name):  
    ws = Workspace.get(
            name=name,
            subscription_id= sid, 
            resource_group='my_ressource',
            location='my_location')
    return ws

def uploadDataset(ws, file, separator=','):
    datastore = Datastore.get_default(ws)
    path = DataPath(datastore=datastore,path_on_datastore=file)
    dataset = TabularDatasetFactory.from_delimited_files(path=path, separator=separator)
    #dataset = Dataset.Tabular.from_delimited_files(path=path, separator=separator)
    print(dataset.to_pandas_dataframe().head())
    print(type(dataset))

ws = getWorkspace(workspace_name)
uploadDataset(ws, my_csv,";")

#result :
   fixed_acidity  volatile_acidity  citric_acid  residual_sugar  chlorides  ...  density    pH  sulphates  alcohol  quality0            7.5              0.33         0.32            11.1      0.036  ...  0.99620  3.15       0.34     10.5        61            6.3              0.27         0.29            12.2      0.044  ...  0.99782  3.14       0.40      8.8        62            7.0              0.30         0.51            13.6      0.050  ...  0.99760  3.07       0.52      9.6        73            7.4              0.38         0.27             7.5      0.041  ...  0.99535  3.17       0.43     10.0        54            8.1              0.12         0.38             0.9      0.034  ...  0.99026  2.80       0.55     12.0        6
[5 rows x 12 columns]
<class 'azureml.data.tabular_dataset.TabularDataset'>

But when I go to Microsoft Azure Machine Learning Studio in datasets, this dataset isn't created. What am I doing wrong?

Firstly we need to check the format of the file, if the format is .csv or .tsv we need to use from_delimited_files() method which has TabularDataSetFactory class to read files. Or else if we have .paraquet files we have a method called as from_parquet_files() . Along with these we have register_pandas_dataframe() method which registers the TabularDataset to the workspace and uploads data to your underlying storage

Also for the storage is there is any virtual network or firewalls enabled then make sure that we set a parameter as validate=False in from_delimited_files() method as this will skip the validation/verification step.

Specify the datastore name as below along with Workspace:

datastore_name = 'your datastore name'

workspace = Workspace.from_config() #if we have existing work space.

datastore = Datastore.get(workspace, datastore_name)

Below is the way to create TabularDataSets from 3 file paths.

datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')]

Create_TBDS = Dataset.Tabular.from_delimited_files(path=datastore_paths)

If we want to specify the separator, we can do it as below:

Create_TBDS = Dataset.Tabular.from_delimited_files(path=datastore_paths, separator=',')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM