简体   繁体   中英

Convert a string into four different columns of a dataframe in spark

Everyone, I am new to Spark (programming to be honest) and I need some help with the below scenario. My input file is having data in below format. Portnumber-UserID “ GET \\ ..” Portnumber-UserID “ GET \\ ..”

For each user we will have two rows of data. Each row contains only one string (spaces included) but with no proper delimiter

Example Inputs:

192.167.56.1-45195 “ GET \docsodb.sp.ip \..”
192.167.56.1-45195 “ GET \https://<url> \..”
238.142.23.5-24788 “ GET \docsodb.sp.ip \..”
238.142.23.5-24788 “ GET \ https://<url>  \..”
30.169.77.213-16745 “ GET \docsodb.sp.ip \..”
30.169.77.213-16745 “ GET \ https://<url> \..”

For the above data I would require output in below format, probably a dataframe.

Portnumber      UserID  URL             division_string
192.167.56.1    45195   https://<url>   docsodb.sp.ip
238.142.23.5    24788   https://<url>   docsodb.sp.ip
30.169.77.213   16745   https://<url>   docsodb.sp.ip

Can we achieve this through RDD transformations or we have to go with Spark SQL (through SQL queries). Also if this can be achieved either way could you please explain which one is a better approach?

Let's prepare the data and run spark-shell

cat <<-EOF >in
192.167.56.1-45195 “ GET \docsodb.sp.ip \..”
192.167.56.1-45195 “ GET \https://<url> \..”
238.142.23.5-24788 “ GET \docsodb.sp.ip \..”
238.142.23.5-24788 “ GET \ https://<url>  \..”
30.169.77.213-16745 “ GET \docsodb.sp.ip \..”
30.169.77.213-16745 “ GET \ https://<url> \..”
EOF

spark-shell

Now inside spark-shell we'll load the data from text file into a DataFrame, then parse it based on regexp capture group, finally group by Portnumber and UserId to get both division_string and URL on a single line, all using DataFrame API.

import spark.implicits._

// Load data
val df = spark.read.format("text").load("in")

// Regexp to parse input line
val re = """([\d\.]+)-(\d+) “ GET \\ ?([^\s]+)"""

// Transform
df.select(regexp_extract('value, re, 1).as("Portnumber"),
          regexp_extract('value, re, 2).as("UserId"),
          regexp_extract('value, re, 3).as("URL_or_div"))
  .groupBy('Portnumber, 'UserId)
  .agg(max(when('URL_or_div.like("https%"), 'URL_or_div)).as("URL"),
       max(when('URL_or_div.like("docsodb%"), 'URL_or_div)).as("division_stringL"))
  .show

+-------------+------+-------------+---------------+
|   Portnumber|UserId|          URL|division_string|
+-------------+------+-------------+---------------+
|30.169.77.213| 16745|https://<url>|  docsodb.sp.ip|
| 192.167.56.1| 45195|https://<url>|  docsodb.sp.ip|
| 238.142.23.5| 24788|https://<url>|  docsodb.sp.ip|
+-------------+------+-------------+---------------+

Anwering your last question DataFrame API or Spark SQL is preferred over RDD operations, unless you need low-level control over the processing. See here for more info.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM