使用 python sdk 在 Azure 中创建表格数据集

Question

所以我刚开始使用 Azure，我遇到了这个问题：

这是我的代码：

def getWorkspace(name):  
    ws = Workspace.get(
            name=name,
            subscription_id= sid, 
            resource_group='my_ressource',
            location='my_location')
    return ws

def uploadDataset(ws, file, separator=','):
    datastore = Datastore.get_default(ws)
    path = DataPath(datastore=datastore,path_on_datastore=file)
    dataset = TabularDatasetFactory.from_delimited_files(path=path, separator=separator)
    #dataset = Dataset.Tabular.from_delimited_files(path=path, separator=separator)
    print(dataset.to_pandas_dataframe().head())
    print(type(dataset))

ws = getWorkspace(workspace_name)
uploadDataset(ws, my_csv,";")

#result :
   fixed_acidity  volatile_acidity  citric_acid  residual_sugar  chlorides  ...  density    pH  sulphates  alcohol  quality0            7.5              0.33         0.32            11.1      0.036  ...  0.99620  3.15       0.34     10.5        61            6.3              0.27         0.29            12.2      0.044  ...  0.99782  3.14       0.40      8.8        62            7.0              0.30         0.51            13.6      0.050  ...  0.99760  3.07       0.52      9.6        73            7.4              0.38         0.27             7.5      0.041  ...  0.99535  3.17       0.43     10.0        54            8.1              0.12         0.38             0.9      0.034  ...  0.99026  2.80       0.55     12.0        6
[5 rows x 12 columns]
<class 'azureml.data.tabular_dataset.TabularDataset'>

但是当我在数据集中访问 Microsoft Azure 机器学习工作室时，并没有创建这个数据集。 我究竟做错了什么？

Answer 1

首先我们需要检查文件的格式，如果格式是.csv或.tsv，我们需要使用具有TabularDataSetFactory类的from_delimited_files()方法来读取文件。 否则，如果我们有.paraquet文件，我们有一个名为from_parquet_files() 。 除了这些，我们还有register_pandas_dataframe()方法，该方法将 TabularDataset 注册到工作区并将数据上传到您的底层存储

同样对于存储，是否启用了任何虚拟网络或防火墙，然后确保我们在from_delimited_files()方法中将参数设置为 validate=False ，因为这将跳过验证/验证步骤。

如下指定数据存储名称和工作区：

datastore_name = 'your datastore name'

workspace = Workspace.from_config() #if we have existing work space.

datastore = Datastore.get(workspace, datastore_name)

下面是从 3 个文件路径创建 TabularDataSet 的方法。

datastore_paths = [(datastore, 'weather/2018/11.csv'),
                   (datastore, 'weather/2018/12.csv'),
                   (datastore, 'weather/2019/*.csv')]

Create_TBDS = Dataset.Tabular.from_delimited_files(path=datastore_paths)

如果我们想指定分隔符，我们可以这样做：

Create_TBDS = Dataset.Tabular.from_delimited_files(path=datastore_paths, separator=',')

使用 python sdk 在 Azure 中创建表格数据集

问题描述

1 个解决方案

解决方案1
0 2021-10-18 07:10:23

使用 python sdk 在 Azure 中创建表格数据集

问题描述

1 个解决方案

解决方案1 0 2021-10-18 07:10:23

解决方案1
0 2021-10-18 07:10:23