简体   繁体   中英

Datalab BigQuery data to Dataproc Hadoop word count

I currently have some reddit data on Google BigQuery that I want to do a word count on for all the comments on a selection of subreddits. The query is about 90GiB so it isn't possible to load into DataLab directly and turn into a data frame. I've been advised to use use a Hadoop or Spark job in DataProc to create a word count and to set up a connector to get BigQuery data into DataProc so that DataProc can do the word count. How do I run this in DataLab?

Here is an example PySpark code for WordCount with the public BigQuery shakespeare dataset:

#!/usr/bin/env python

"""BigQuery I/O PySpark example."""
from pyspark.sql import SparkSession

# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the connector.
bucket=sys.argv[1]
spark.conf.set('temporaryGcsBucket', bucket)

spark = SparkSession \
  .builder \
  .master('yarn') \
  .appName('spark-bigquery-demo') \
  .getOrCreate()

# Load data from BigQuery.
words = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data:samples.shakespeare') \
  .load()
words.createOrReplaceTempView('words')

# Perform word count.
word_count = spark.sql(
    'SELECT word, SUM(word_count) AS word_count FROM words GROUP BY word')
word_count.show()
word_count.printSchema()

# Saving the data to BigQuery
word_count.write.format('bigquery') \
  .option('table', 'wordcount_dataset.wordcount_output') \
  .save()

You can save the script locally or in a GCS bucket, then submit it to a Dataproc cluster with:

gcloud dataproc jobs submit pyspark --cluster=<cluster> <pyspark-script> -- <bucket>

Also check this doc for info.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM