简体   繁体   中英

Spark on Google Cloud Dataproc job failures on last stages

I work with Spark cluster on Dataproc and my job fails in the end of processing.

My datasource is text logs files in csv format on Google Cloud Storage (total volume is 3.5TB, 5000 files).

The processing logic is following:

  • read files to DataFrame (schema ["timestamp", "message"]);
  • group all messages into window of 1 second;
  • apply pipeline [Tokenizer -> HashingTF] to every grouped message to extract words and their frequencies to build a feature vectors;
  • save feature vectors with timelines on GCS.

The issues that I'm having is that on small subset of data (like 10 files) processing works well, but when I'm running it on all files it fails in the very end with error like "Container killed by YARN for exceeding memory limits. 25.0 GB of 24 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead."

My cluster has 25 worker with n1-highmem-8 machines. So I googled for this error and literally increased "spark.yarn.executor.memoryOverhead" parameter to 6500MB.

Now my spark job still fails, but with error "Job aborted due to stage failure: Total size of serialized results of 4293 tasks (1920.0 MB) is bigger than spark.driver.maxResultSize (1920.0 MB)"

I'm new to spark and I believe that I'm doing something wrong or on the configuration level, or in my code. If you can help me to clean these thing up, it will be great!

Here is my code for the spark task:

import logging
import string
from datetime import datetime

import pyspark
import re
from pyspark.sql import SparkSession

from pyspark.ml.feature import HashingTF, IDF, Tokenizer
from pyspark.ml import Pipeline

from pyspark.sql.types import StructType, StructField
from pyspark.sql.types import DoubleType, IntegerType, StringType, TimestampType, ArrayType
from pyspark.sql import functions as F

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Constants
NOW = datetime.now().strftime("%Y%m%d%H%M%S")
START_DATE = '2016-01-01'
END_DATE = '2016-03-01'

sc = pyspark.SparkContext()
spark = SparkSession\
        .builder\
        .appName("LogsVectorizer")\
        .getOrCreate()
spark.conf.set('spark.sql.shuffle.partitions', 10000)

logger.info("Start log processing at {}...".format(NOW))

# Filenames to read/write locations
logs_fn = 'gs://databucket/csv/*'  
vectors_fn = 'gs://databucket/vectors_out_{}'.format(NOW)  
pipeline_fn = 'gs://databucket/pipeline_vectors_out_{}'.format(NOW)
model_fn = 'gs://databucket/model_vectors_out_{}'.format(NOW)


# CSV data schema to build DataFrame
schema = StructType([
    StructField("timestamp", StringType()),
    StructField("message", StringType())])

# Helpers to clean strings in log fields
def cleaning_string(s):
    try:
        # Remove ids (like: app[2352] -> app)
        s = re.sub('\[.*\]', 'IDTAG', s)
        if s == '':
            s = 'EMPTY'
    except Exception as e:
        print("Skip string with exception {}".format(e))
    return s

def normalize_string(s):
    try:
        # Remove punctuation
        s = re.sub('[{}]'.format(re.escape(string.punctuation)), ' ', s)
        # Remove digits
        s = re.sub('\d*', '', s)
        # Remove extra spaces
        s = ' '.join(s.split())
    except Exception as e:
        print("Skip string with exception {}".format(e)) 
    return s

def line_splitter(line):
    line = line.split(',')
    timestamp = line[0]
    full_message = ' '.join(line[1:])
    full_message = normalize_string(cleaning_string(full_message))
    return [timestamp, full_message]

# Read line from csv, split to date|message
# Read CSV to DataFrame and clean its fields
logger.info("Read CSV to DF...")
logs_csv = sc.textFile(logs_fn)
logs_csv = logs_csv.map(lambda line: line_splitter(line)).toDF(schema)

# Keep only lines for our date interval
logger.info("Filter by dates...")
logs_csv = logs_csv.filter((logs_csv.timestamp>START_DATE) & (logs_csv.timestamp<END_DATE))
logs_csv = logs_csv.withColumn("timestamp", logs_csv.timestamp.cast("timestamp"))

# Helpers to join messages into window and convert sparse to dense
join_ = F.udf(lambda x: "| ".join(x), StringType())
asDense = F.udf(lambda v: v.toArray().tolist())

# Agg by time window
logger.info("Group log messages by time window...")
logs_csv = logs_csv.groupBy(F.window("timestamp", "1 second"))\
                       .agg(join_(F.collect_list("message")).alias("messages"))

# Turn message to hashTF
tokenizer = Tokenizer(inputCol="messages", outputCol="message_tokens")
hashingTF = HashingTF(inputCol="message_tokens", outputCol="tokens_counts", numFeatures=1000)

pipeline_tf = Pipeline(stages=[tokenizer, hashingTF])

logger.info("Fit-Transform ML Pipeline...")
model_tf = pipeline_tf.fit(logs_csv)
logs_csv = model_tf.transform(logs_csv)

logger.info("Spase vectors to Dense list...")
logs_csv = logs_csv.sort("window.start").select(["window.start", "tokens_counts"])\
                   .withColumn("tokens_counts", asDense(logs_csv.tokens_counts))

# Save to disk
# Save Pipeline and Model
logger.info("Save models...")
pipeline_tf.save(pipeline_fn)
model_tf.save(model_fn)

# Save to GCS
logger.info("Save results to GCS...")
logs_csv.write.parquet(vectors_fn)

spark.driver.maxResultSize is an issue with the size of your driver, which in Dataproc runs on the master node.

By default 1/4 of the memory of the master is given to Driver and 1/2 of that is given set to spark.driver.maxResultSize (the largest RDD Spark will let you .collect() .

I'm guessing Tokenizer or HashingTF are moving "metadata" through the driver that is the size of your keyspace. To increase the allowable size you can increase spark.driver.maxResultSize , but you might also want to increase spark.driver.memory and/or use a larger master as well. Spark's configuration guide has more information.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM