简体   繁体   中英

groupBykey in spark

New to spark here and I'm trying to read a pipe delimited file in spark. My file looks like this:

user1|acct01|A|Fairfax|VA
user1|acct02|B|Gettysburg|PA
user1|acct03|C|York|PA
user2|acct21|A|Reston|VA
user2|acct42|C|Fairfax|VA
user3|acct66|A|Reston|VA

and I do the following in scala:

scala> case class Accounts (usr: String, acct: String, prodCd: String, city: String, state: String)
defined class Accounts

scala> val accts = sc.textFile("accts.csv").map(_.split("|")).map(
     | a => (a(0), Accounts(a(0), a(1), a(2), a(3), a(4)))
     | )

I then try to group the key value pair by the key, and this is not sure if I'm doing this right...is this how I do it?

scala> accts.groupByKey(2)
res0: org.apache.spark.rdd.RDD[(String, Iterable[Accounts])] = ShuffledRDD[4] at groupByKey at <console>:26

I thought the (2) is to give me the first two results back but I don't seem to get anything back at the console...

If I run a distinct...I get this too..

scala> accts.distinct(1).collect(1)
<console>:26: error: type mismatch;
 found   : Int(1)
 required: PartialFunction[(String, Accounts),?]
              accts.distinct(1).collect(1)

EDIT: Essentially I'm trying to get to a key value pair nested mapping. For example, user1 would looke like this:

user1 | {'acct01': {prdCd: 'A', city: 'Fairfax', state: 'VA'}, 'acct02': {prdCd: 'B', city: 'Gettysburg', state: 'PA'}, 'acct03': {prdCd: 'C', city: 'York', state: 'PA'}}

trying to learn this step by step so thought I'd break it down into chunks to understand...

I think you might have better luck if you put your data into a DataFrame if you've already gone through the process of defining a schema. First off, you need to modify the split comment to use single quotes. (See this question ). Also, you can get rid of the a(0) in the beginning. Then, converting to a DataFrame is trivial. (Note that DataFrames are available on spark 1.3+.)

val accts = sc.textFile("/tmp/accts.csv").map(_.split('|')).map(a => Accounts(a(0), a(1), a(2), a(3), a(4)))
val df = accts.toDF()

Now df.show produces:

+-----+------+------+----------+-----+
|  usr|  acct|prodCd|      city|state|
+-----+------+------+----------+-----+
|user1|acct01|     A|   Fairfax|   VA|
|user1|acct02|     B|Gettysburg|   PA|
|user1|acct03|     C|      York|   PA|
|user2|acct21|     A|    Reston|   VA|
|user2|acct42|     C|   Fairfax|   VA|
|user3|acct66|     A|    Reston|   VA|
+-----+------+------+----------+-----+

It should be easier for you to work with the data. For example, to get a list of the unique users:

df.select("usr").distinct.collect()

produces

res42: Array[org.apache.spark.sql.Row] = Array([user1], [user2], [user3])

For more details, check out the docs .

3 observations that may help you understand the problem:

1) groupByKey(2) does not return first 2 results, the parameter 2 is used as number of partitions for the resulting RDD. See docs .

2) collect does not take Int parameter. See docs .

3) split takes 2 types of parameters, Char or String . String version uses Regex so "|" needs escaping if intended as literal.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM