简体   繁体   中英

Spark Notebook: How can I filter rows based on a column value where each column cell is an array of strings?

I have a huge dataframe where a column "categories" has various attributes of a business ie whether it's a Restaurant, Laundry Service, Disco Theque etc. What I need is to be able to .filter the dataframe such that each row which contains Restaurant may be seen. The problem here is that "categories" is an array of strings where a cell may be like: "Restaurants, Food, Nightlife". Any ideas? (Scala [2.10.6] Spark [2.0.1] Hadoop [2.7.2])

I have tried SQL style queries like:

val countResult = sqlContext.sql(
   "SELECT business.neighborhood, business.state, business.stars, business.categories 
    FROM business where business.categories == Restaurants group by business.state"
).collect() 
display(countResult) 

and

dfBusiness.filter($"categories" == "Restaurants").show()

and

dfBusiness.filter($"categories" == ["Restaurants"]).show() 

I think I might need to iterate over each cell but I have no idea on how to do that.

Any ideas?

The functions library can be very helpful for processing columns in a DataFrame . In this case, array_contains should provide what you need:

dfBusiness.filter(array_contains($"categories", "Restaurants"))

This filters out any rows that don't have a " Restaurants " element in the array in the categories column.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM