Datalab BigQuery data to Dataproc Hadoop word count

Question

I currently have some reddit data on Google BigQuery that I want to do a word count on for all the comments on a selection of subreddits. The query is about 90GiB so it isn't possible to load into DataLab directly and turn into a data frame. I've been advised to use use a Hadoop or Spark job in DataProc to create a word count and to set up a connector to get BigQuery data into DataProc so that DataProc can do the word count. How do I run this in DataLab?

Answer 1

Here is an example PySpark code for WordCount with the public BigQuery shakespeare dataset:

#!/usr/bin/env python

"""BigQuery I/O PySpark example."""
from pyspark.sql import SparkSession

# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the connector.
bucket=sys.argv[1]
spark.conf.set('temporaryGcsBucket', bucket)

spark = SparkSession \
  .builder \
  .master('yarn') \
  .appName('spark-bigquery-demo') \
  .getOrCreate()

# Load data from BigQuery.
words = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data:samples.shakespeare') \
  .load()
words.createOrReplaceTempView('words')

# Perform word count.
word_count = spark.sql(
    'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()
word_count.printSchema()

# Saving the data to BigQuery
word_count.write.format('bigquery') \
  .option('table', 'wordcount_dataset.wordcount_output') \
  .save()

You can save the script locally or in a GCS bucket, then submit it to a Dataproc cluster with:

gcloud dataproc jobs submit pyspark --cluster=<cluster> <pyspark-script> -- <bucket>

Also check this doc for info.

Datalab BigQuery data to Dataproc Hadoop word count

Question

1 answers

solution1
0 2021-10-25 23:39:06

Datalab BigQuery data to Dataproc Hadoop word count

Question

1 answers

solution1 0 2021-10-25 23:39:06

solution1
0 2021-10-25 23:39:06