PySpark count() can't process 684 GB .txt file

Question

I am using PySpark to see how many times each timestamp appears in this very large data set using count() . My data set is from a 684 GB .txt file. However, when I use count() it takes a very long time and eventually just stops trying to process. My work computer has 16 GB Memory and 4 CPU Cores. I am also using Jupyter Notebook in Anaconda.
Here is what I have so far:

spark = SparkSession.builder.appName('pyspark-shellTest2').getOrCreate()
spark

Output:

SparkSession - in-memory

SparkContext

Spark UI

Version
v3.3.0
Master
local[*]
AppName
pyspark-shellTest2

Reading.txt file and selecting the columns I want to keep

raw_data = spark.read.options(delimiter="\t",header=True).csv("O:/Corbin/Canvas/requests_12_05_2022.txt")
Get_col = raw_data.select('timestamp_day', 'user_id', 'course_id')
Get_col.show(3)

Output:

+-------------+------------------+------------------+
|timestamp_day|           user_id|         course_id|
+-------------+------------------+------------------+
|   2022-09-15|425465600693903129|                \N|
|   2022-09-15|508873040735657962|193340000000014379|
|   2022-09-15|284347190388427414|193340000000014966|
+-------------+------------------+------------------+
only showing top 3 rows

Number of partitions (Not sure if I need to change this with repartition() )

Get_col.rdd.getNumPartitions()

output:

This where it should output, but it takes super long and eventually stops processing

Get_col.groupBy('timestamp_day').count().show()

Answer 1

Since you're not getting any errors yet, I would not immediately assume something is wrong. You might bounce onto data skew problems for example, but it is a bit too early to conclude that.

What you can do, however, is have a look at the Spark UI . This is an interactive UI that runs in any browser which will give you quite detailed information about your currently running Spark application. You'll be able to see which jobs/stages/tasks are currently running and you'll be able to see whether there might be data skew issues or others.

This service runs by default on port 4040 of the machine in which your driver is running. So some examples:

If your application is running locally, that means your driver is running locally as well. Just visit localhost:4040 on a browser and you'll see the UI!
If your app is running on Kube.netes, you could port-forward the driver pod's port 4040 onto your machine like so: kubectl port-forward <driver-pod> 4040:4040 . Then you should be able to visit localhost:4040 on your machine as well
...

Hope this helps!

EDIT: Once you've got the Spark UI running, you can start debugging a bit. This article discusses some of the things you can do.

PySpark count() can't process 684 GB .txt file

Question

1 answers

solution1
0 2022-12-10 06:42:42

PySpark count() can't process 684 GB .txt file

Question

1 answers

solution1 0 2022-12-10 06:42:42

solution1
0 2022-12-10 06:42:42