Tag[amazon-emr] Recent Newest Questions

Is Spark good for automatically running statistical analysis script in many nodes for a speedup?

I have a Python script that runs statistical analysis and trained deep learning models on input data. The data size is fairly small (~5Mb) however the ...

Using S3A client in EMR serverless

I am using a S3 compatible object store (CloudFlare R2) and trying to get EMR serverless to connect to it. R2 requires that you use the correct endpoi ...

Why do we need HDFS on EMR when we have S3

In our place, we use AWS services for all our data infrastructure and services needs. Our hive tables are external tables and the actual data files ar ...

Does saving Spark DF into s3 path writes data to EBS volume first?

I am curious to know what happens behind the scenes when writing Spark DF as a Parquet file on S3 location. Does it first stores it locally on the loc ...

how to get EMR cluster version from a running cluster

I have several accounts and they run different versions of EMR. I need to run a query to figure out what version (list-release-labels) they are runnin ...

EMR 6.7 configuration in EMR 6.9 gives error Classification 'spark-log4j' is not valid for parent classification 'null'

I was using emr 6.7 with the software configuration: but for some reason when I shifted to emr 6.9. The was website started throwing error Cla ...

AWS EMR cluster - scale up didn't update dfs.replication value from 1 to 2

I provisioned an AWS EMR HBASE cluster with 1 master and 1 core node (m5.xLarge). My cluster doesn't have any 'task' node as I plan to use this cluste ...

in aws emr job flow, does each step receive the output from the previous step?

I am making a map reduce program in Java that has 4 steps. each step is operating on the output of the previous step. I ran those steps locally and m ...

AWS EMR CLI add-step with multiple files

I have an EMR environment that runs fine when I submit single python (pyspark) files from a local shell script (myProgram.py was already copied up to ...

PySpark `monotonically_increasing_id()` returns 0 for each row

This code creates and prints a data frame where each id has value 0. I am really confused as this is monotonically_increasing_id method descriptio ...

AWS EMR jupyter error 403 Forbidden (Workspace is not attached to cluster)

I have a simple notebook in EMR. I have no running clusters. From the notebook open page itself I request a new cluster so my expectation is that all ...

Great expectations installation to AWS EMR

I tried to use great expectations for data quality purpose I am running my jobs in AWS EMR cluster and I am trying to launch great expectations job o ...

How to make sure that spark.write.parquet() writes the data frame on to the file specified using relative path and not in HDFS, on EMR?

My problem is as below: A pyspark script that runs perfectly on a local machine and an EC2 is ported on to an EMR for scaling up. There's a config fi ...

How to read postgres DB tables through EMR jupyter lab notebook from amazon workspace

I'm trying to read the table from postgres tables. but i'm facing below error. Note: i cannot be able to refer external files from local since it is a ...

SparkException: Exception thrown in awaitResult for EMR

I tried running my Spark application from EMR, which right now is just the pi calculation in the tutorial doc: https://docs.aws.amazon.com/emr/latest/ ...

How to encrypt apache hudi external tables data present in s3 synced into hive tables through spark jobs

Technical background: I am getting tables data from kafka and putting it into hudi and hive tables using spark. I am using AWS EMR. I want to encrypt ...

Trying to run a spark job in linux to parse a large amount of tab deliminted data and index on column six. Getting permission denied

static void Main(string[] args) { DataTable datatable = new DataTable(); StreamReader streamreader = new StreamReader(@"/data/1/projects/data1 ...

What is the meaning of executionRoleArn in boto3 API for EMR Serverless?

According to the API for the function start_job_run, I need to give a executionRoleArn - what is this? I thought is the name of the IAM role I created ...

How to run existing EMR serverless job with boto3?

From boto3 doc for the start_job_run, it seems like I have to create job run every time I want to trigger a job. Does it really have to work that way? ...

How can I reuse spark SQL view/table across multiple AWS EMR steps?

I am submitting multiple steps (concurrency - 1) to AWS EMR cluster by command - 'spark-submit --deploy-mode client --master yarn <>' one after ...