Load data from Azure Data Lake to Jupyter notebook on DSVM

Question

I am trying to load data from Azure Data Lake to a Jupyter notebook in my Data Science VM. Note that I am the owner of the data lake storage and have read, write, and execute permissions. The data science VM with Jupyter is running under the same subscription and is under the same resource group. I am trying the following two approaches and both face an issue. They are based on this blog post.

PySpark

the following is the code that I use to load the data using PySpark:

hvacText = sc.textFile("adl://name.azuredatalakestore.net/file_to_read.csv")
hvacText.count()

The following exception is thrown:

Py4JJavaError: An error occurred while calling o52.text.
: java.io.IOException: No FileSystem for scheme: adl
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:616)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:350)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:349)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:623)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)

Python SDK:

the following is the code I use to access the data using the SDK:

from azure.datalake.store import core, lib, multithread    
token = lib.auth()
# output: To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code XXXX to authenticate.
# here I open the link and enter the code successfully
adl = core.AzureDLFileSystem(token, store_name='store_name')
adl.ls('/')

The following exception happens:

DatalakeRESTException: Data-lake REST exception: LISTSTATUS, .

I am more interested in fixing the Spark solution, but any help would be much appreciated.

Answer 1

You need to do two things to use the ADLS connector on the DSVM:

Add two jars, hadoop-azure-datalake-3.0.0-alpha3.jar and azure-data-lake-store-sdk-2.1.5.jar, to spark-defaults.conf, by editing /dsvm/tools/spark/current/conf/spark-defaults.conf and add both jars to spark.jars. We don't load them by default so users get a faster startup time.
Create core-site.xml: also in the conf directory, copy core-site.xml.template to core-site.xml. Keep only the ADLS part and enter your values.

You also need to fix broken symlinks in the current image: in /dsvm/tools/spark/current/jars, there are symlinks for azure-data-lake-store-sdk-2.0.11.jar and hadoop-azure-datalake-3.0.0-alpha2.jar. You should remove these and add symlinks to /opt/adls-jars/hadoop-azure-datalake-3.0.0-alpha3.jar and /opt/adls-jars/azure-data-lake-store-sdk-2.1.5.jar. This is a bug on our part.

Answer 2

Did you edit or create core-site.xml in $SPARK_HOME/conf (Must be /dsvm/tools/spark/current/conf) with add the config property as specified in the reference article you linked with the ADLS access tokens and adl schema details? (Pasted here for convenience).

<configuration>
  <property>
        <name>dfs.adls.oauth2.access.token.provider.type</name>
        <value>ClientCredential</value>
  </property>

  <property>
      <name>dfs.adls.oauth2.refresh.url</name>
      <value>YOUR TOKEN ENDPOINT</value>
  </property>
  <property>
      <name>dfs.adls.oauth2.client.id</name>
      <value>YOUR CLIENT ID</value>
  </property>
  <property>
      <name>dfs.adls.oauth2.credential</name>
      <value>YOUR CLIENT SECRET</value>
  </property>
  <property>
      <name>fs.adl.impl</name>
      <value>org.apache.hadoop.fs.adl.AdlFileSystem</value>
  </property>
  <property>
      <name>fs.AbstractFileSystem.adl.impl</name>
      <value>org.apache.hadoop.fs.adl.Adl</value>
  </property>  
</configuration>

The ADLS connectivity JAR files are already prebuilt into the DSVM.

Load data from Azure Data Lake to Jupyter notebook on DSVM

Question

2 answers

solution1
3 ACCPTED 2018-05-25 23:31:08

solution2
1 2018-05-19 02:16:50

Load data from Azure Data Lake to Jupyter notebook on DSVM

Question

2 answers

solution1 3 ACCPTED 2018-05-25 23:31:08

solution2 1 2018-05-19 02:16:50

solution1
3 ACCPTED 2018-05-25 23:31:08

solution2
1 2018-05-19 02:16:50