I am using Google Colab to teach Spark (including Spark SQL) to my students and am using the following set of commands to install and configure Spark
!pip install -q pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext
Load Data
!wget 'http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz'
!mv kddcup.data_10_percent.gz kdd10.gz
data_file = "./kdd10.gz"
raw_data = sc.textFile(data_file) # approx 490K records
raw_data_sample = raw_data.sample(False, 0.1, 1234) # 10% sample, 49K records
prepare the data
from pyspark.sql import Row
csv_data = raw_data.map(lambda l: l.split(",")) # using full data
#csv_data = raw_data_sample.map(lambda l: l.split(",")) # using 10% sample data
row_data = csv_data.map(lambda p: Row(
duration=int(p[0]),
protocol_type=p[1],
service=p[2],
flag=p[3],
src_bytes=int(p[4]),
dst_bytes=int(p[5])
)
)
Create Tables
#interactions_df = sqlContext.createDataFrame(row_data) -- deprecated
interactions_df = spark.createDataFrame(row_data)
#interactions_df.registerTempTable("interactions") -- deprecated
interactions_df.createOrReplaceTempView("interactions")
Run SQL Query
#tcp_interactions = sqlContext.sql(""" --- deprecated
tcp_interactions = spark.sql("""
SELECT duration, protocol_type, dst_bytes FROM interactions WHERE protocol_type = 'udp'
""")
tcp_interactions.show()
My problem is as follows.
With the 10% sample data, the query runs perfectly and gives the result, but with the full 490K record datafile, the query hangs indefinitely. There is no error as such, except when I abort the command
/usr/local/lib/python3.7/dist-packages/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
492
493 if isinstance(truncate, bool) and truncate:
--> 494 print(self._jdf.showString(n, 20, vertical))
495 else:
496 try:
/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in __call__(self, *args)
1318 proto.END_COMMAND_PART
1319
-> 1320 answer = self.gateway_client.send_command(command)
1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in send_command(self, command, retry, binary)
1036 connection = self._get_connection()
1037 try:
-> 1038 response = connection.send_command(command)
1039 if binary:
1040 return response, self._create_connection_guard(connection)
/usr/local/lib/python3.7/dist-packages/py4j/clientserver.py in send_command(self, command)
473 try:
474 while True:
--> 475 answer = smart_decode(self.stream.readline()[:-1])
476 logger.debug("Answer received: {0}".format(answer))
477 # Happens when a the other end is dead. There might be an empty
/usr/lib/python3.7/socket.py in readinto(self, b)
587 while True:
588 try:
--> 589 return self._sock.recv_into(b)
590 except timeout:
591 self._timeout_occurred = True
KeyboardInterrupt:
I have observed the same problem with another dataset of similar size. What is interesting is that 6 months ago, while teaching the previous batch of students, this code used to work perfectly with 490K of data. So what is going wrong and how can I fix it? Grateful for any help in this regard.
What is puzzling is the following query works for both the full data and for the sample data. But nothing else works!
#tcp_interactions = sqlContext.sql(""" --- deprecated
tcp_interactions = spark.sql("""
SELECT distinct(protocol_type) FROM interactions
""")
tcp_interactions.show()
Fixed the problem by falling back on an earlier version of Spark. Manually downloading and installing spark-3.0.3 instead of using pip (that downloads spark 3.2.1) solves the problem. To see the solution working please visit https://github.com/Praxis-QR/BDSN/blob/main/SQL_Spark_with_OLD_version_JoseDianes_Intro.ipynb
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.