just started with scala 2 days ago.
Here's the thing, I have a df and a list. The df contains two columns: paragraphs and authors, the list contains words (strings). I need to get the count of all the paragraphs where every word on list appears by author.
So far my idea was to create a for loop on the list to query the df using rlike and create a new df, but even if this does work, I wouldn't know how to do it. Any help is appreciated!
Edit: Adding example data and expected output
// Example df and list
val df = Seq(("auth1", "some text word1"), ("auth2","some text word2"),("auth3", "more text word1").toDF("a","t")
df.show
+-------+---------------+
| a| t|
+-------+---------------+
|auth1 |some text word1|
|auth2 |some text word2|
|auth1 |more text word1|
+-------+---------------+
val list = List("word1", "word2")
// Expected output
newDF.show
+-------+-----+----------+
| word| a|text count|
+-------+-----+----------+
|word1 |auth1| 2|
|word2 |auth2| 1|
+-------+-----+----------+
You can do a filter and aggregation for each word in the list, and combine all the resulting dataframes using unionAll
:
val result = list.map(word =>
df.filter(df("t").rlike(s"\\b${word}\\b"))
.groupBy("a")
.agg(lit(word).as("word"), count(lit(1)).as("text count"))
).reduce(_ unionAll _)
result.show
+-----+-----+----------+
| a| word|text count|
+-----+-----+----------+
|auth3|word1| 1|
|auth1|word1| 1|
|auth2|word2| 1|
+-----+-----+----------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.