简体   繁体   中英

Regex in Apache Spark

I have a text file that reads like this:-

This recipe can be made either with a stand mixer, or by hand with a bowl, a
wooden spoon, and strong arms. If you use salted butter, please omit the
added salt in this recipe.
Yum
Ingredients
1 1/4 cups all-purpose flour (160 g)
1/4 teaspoon salt
1/2 teaspoon baking powder
1/2 cup unsalted butter (1 stick, or 8 Tbsp, or 112g) at room temperature
1/2 cup white sugar (90 g)
1/2 cup dark brown sugar, packed (85 g)
1 large egg
1 teaspoon vanilla extract
1/2 teaspoon instant coffee granules or instant espresso powder
1/2 cup chopped macadamia nuts (3 1/2 ounces, or 100 g)
1/2 cup white chocolate chips
Method
1 Preheat the oven to 350°F (175°C). Vigorously whisk together the flour,
and baking powder in a bowl and set aside.

I want to extract the data between words Ingredients and Method.
I have written a regex (?s)(?<=\\bIngredients\\b).*?(?=\\bMethod\\b)
to extract the data and it's working fine.
But when I try to that using spark-shell like following, it doesn't give me
anything.

val b = sc.textFile("/home/akshat/file.txt")
val regex = "(?s)(?<=\bIngredients\b).*?(?=\bMethod\b)".r
regex.findAllIn(b).foreach(println)

Please tell me where I am going wrong and what steps should I take to
correct this?
Thanks in advance!

what you need to do is

  1. Read the file using WholeTextFiles (so it does not break lines and you read entire data together)
  2. Write a function which takes a string and outputs a string using that regex so, it may look like (in python)

Blockquote

def getWhatIneed(s):
    output = <my regexp>
    return output

b = sc.WholeTextFiles(...)
c = b.map(getWhatIneed)

Now, c is also a RDD. You need to collect it before you print it. Output of collect is a normal array/list

print c.collect()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM