Spark SQL in Google Colab fails on large data

Question

I am using Google Colab to teach Spark (including Spark SQL) to my students and am using the following set of commands to install and configure Spark

!pip install -q pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext

Load Data

!wget 'http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz'
!mv kddcup.data_10_percent.gz kdd10.gz

data_file = "./kdd10.gz"
raw_data = sc.textFile(data_file)                    # approx 490K records
raw_data_sample = raw_data.sample(False, 0.1, 1234)  # 10% sample, 49K records

prepare the data

from pyspark.sql import Row
csv_data = raw_data.map(lambda l: l.split(","))              # using full data
#csv_data = raw_data_sample.map(lambda l: l.split(","))      # using 10% sample data
row_data = csv_data.map(lambda p: Row(
    duration=int(p[0]), 
    protocol_type=p[1],
    service=p[2],
    flag=p[3],
    src_bytes=int(p[4]),
    dst_bytes=int(p[5])
    )
)

Create Tables

#interactions_df = sqlContext.createDataFrame(row_data) -- deprecated
interactions_df = spark.createDataFrame(row_data)
#interactions_df.registerTempTable("interactions") -- deprecated
interactions_df.createOrReplaceTempView("interactions")

Run SQL Query

#tcp_interactions = sqlContext.sql("""               --- deprecated
tcp_interactions = spark.sql("""
    SELECT duration, protocol_type, dst_bytes FROM interactions WHERE protocol_type = 'udp' 
""")
tcp_interactions.show()

My problem is as follows.

With the 10% sample data, the query runs perfectly and gives the result, but with the full 490K record datafile, the query hangs indefinitely. There is no error as such, except when I abort the command

/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
    492 
    493         if isinstance(truncate, bool) and truncate:
--> 494             print(self._jdf.showString(n, 20, vertical))
    495         else:
    496             try:

/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in __call__(self, *args)
   1318             proto.END_COMMAND_PART
   1319 
-> 1320         answer = self.gateway_client.send_command(command)
   1321         return_value = get_return_value(
   1322             answer, self.gateway_client, self.target_id, self.name)

/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in send_command(self, command, retry, binary)
   1036         connection = self._get_connection()
   1037         try:
-> 1038             response = connection.send_command(command)
   1039             if binary:
   1040                 return response, self._create_connection_guard(connection)

/usr/local/lib/python3.7/dist-packages/py4j/clientserver.py in send_command(self, command)
    473         try:
    474             while True:
--> 475                 answer = smart_decode(self.stream.readline()[:-1])
    476                 logger.debug("Answer received: {0}".format(answer))
    477                 # Happens when a the other end is dead. There might be an empty

/usr/lib/python3.7/socket.py in readinto(self, b)
    587         while True:
    588             try:
--> 589                 return self._sock.recv_into(b)
    590             except timeout:
    591                 self._timeout_occurred = True

KeyboardInterrupt:

I have observed the same problem with another dataset of similar size. What is interesting is that 6 months ago, while teaching the previous batch of students, this code used to work perfectly with 490K of data. So what is going wrong and how can I fix it? Grateful for any help in this regard.

What is puzzling is the following query works for both the full data and for the sample data. But nothing else works!

#tcp_interactions = sqlContext.sql("""  --- deprecated
tcp_interactions = spark.sql("""
    SELECT distinct(protocol_type) FROM interactions 
""")
tcp_interactions.show()

Answer 1

Fixed the problem by falling back on an earlier version of Spark. Manually downloading and installing spark-3.0.3 instead of using pip (that downloads spark 3.2.1) solves the problem. To see the solution working please visit https://github.com/Praxis-QR/BDSN/blob/main/SQL_Spark_with_OLD_version_JoseDianes_Intro.ipynb

Spark SQL in Google Colab fails on large data

Question

1 answers

solution1
0 2022-05-20 06:28:55

Spark SQL in Google Colab fails on large data

Question

1 answers

solution1 0 2022-05-20 06:28:55

solution1
0 2022-05-20 06:28:55