I have a text file that reads like this:-
This recipe can be made either with a stand mixer, or by hand with a bowl, a
wooden spoon, and strong arms. If you use salted butter, please omit the
added salt in this recipe.
Yum
Ingredients
1 1/4 cups all-purpose flour (160 g)
1/4 teaspoon salt
1/2 teaspoon baking powder
1/2 cup unsalted butter (1 stick, or 8 Tbsp, or 112g) at room temperature
1/2 cup white sugar (90 g)
1/2 cup dark brown sugar, packed (85 g)
1 large egg
1 teaspoon vanilla extract
1/2 teaspoon instant coffee granules or instant espresso powder
1/2 cup chopped macadamia nuts (3 1/2 ounces, or 100 g)
1/2 cup white chocolate chips
Method
1 Preheat the oven to 350°F (175°C). Vigorously whisk together the flour,
and baking powder in a bowl and set aside.
I want to extract the data between words Ingredients and Method.
I have written a regex (?s)(?<=\\bIngredients\\b).*?(?=\\bMethod\\b)
to extract the data and it's working fine.
But when I try to that using spark-shell like following, it doesn't give me
anything.
val b = sc.textFile("/home/akshat/file.txt")
val regex = "(?s)(?<=\bIngredients\b).*?(?=\bMethod\b)".r
regex.findAllIn(b).foreach(println)
Please tell me where I am going wrong and what steps should I take to
correct this?
Thanks in advance!
what you need to do is
Blockquote
def getWhatIneed(s):
output = <my regexp>
return output
b = sc.WholeTextFiles(...)
c = b.map(getWhatIneed)
Now, c is also a RDD. You need to collect it before you print it. Output of collect is a normal array/list
print c.collect()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.