Everyone, I am new to Spark (programming to be honest) and I need some help with the below scenario. My input file is having data in below format. Portnumber-UserID “ GET \\ ..” Portnumber-UserID “ GET \\ ..”
For each user we will have two rows of data. Each row contains only one string (spaces included) but with no proper delimiter
Example Inputs:
192.167.56.1-45195 “ GET \docsodb.sp.ip \..”
192.167.56.1-45195 “ GET \https://<url> \..”
238.142.23.5-24788 “ GET \docsodb.sp.ip \..”
238.142.23.5-24788 “ GET \ https://<url> \..”
30.169.77.213-16745 “ GET \docsodb.sp.ip \..”
30.169.77.213-16745 “ GET \ https://<url> \..”
For the above data I would require output in below format, probably a dataframe.
Portnumber UserID URL division_string
192.167.56.1 45195 https://<url> docsodb.sp.ip
238.142.23.5 24788 https://<url> docsodb.sp.ip
30.169.77.213 16745 https://<url> docsodb.sp.ip
Can we achieve this through RDD transformations or we have to go with Spark SQL (through SQL queries). Also if this can be achieved either way could you please explain which one is a better approach?
Let's prepare the data and run spark-shell
cat <<-EOF >in
192.167.56.1-45195 “ GET \docsodb.sp.ip \..”
192.167.56.1-45195 “ GET \https://<url> \..”
238.142.23.5-24788 “ GET \docsodb.sp.ip \..”
238.142.23.5-24788 “ GET \ https://<url> \..”
30.169.77.213-16745 “ GET \docsodb.sp.ip \..”
30.169.77.213-16745 “ GET \ https://<url> \..”
EOF
spark-shell
Now inside spark-shell we'll load the data from text file into a DataFrame, then parse it based on regexp capture group, finally group by Portnumber and UserId to get both division_string and URL on a single line, all using DataFrame API.
import spark.implicits._
// Load data
val df = spark.read.format("text").load("in")
// Regexp to parse input line
val re = """([\d\.]+)-(\d+) “ GET \\ ?([^\s]+)"""
// Transform
df.select(regexp_extract('value, re, 1).as("Portnumber"),
regexp_extract('value, re, 2).as("UserId"),
regexp_extract('value, re, 3).as("URL_or_div"))
.groupBy('Portnumber, 'UserId)
.agg(max(when('URL_or_div.like("https%"), 'URL_or_div)).as("URL"),
max(when('URL_or_div.like("docsodb%"), 'URL_or_div)).as("division_stringL"))
.show
+-------------+------+-------------+---------------+
| Portnumber|UserId| URL|division_string|
+-------------+------+-------------+---------------+
|30.169.77.213| 16745|https://<url>| docsodb.sp.ip|
| 192.167.56.1| 45195|https://<url>| docsodb.sp.ip|
| 238.142.23.5| 24788|https://<url>| docsodb.sp.ip|
+-------------+------+-------------+---------------+
Anwering your last question DataFrame API or Spark SQL is preferred over RDD operations, unless you need low-level control over the processing. See here for more info.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.