Apache Spark中的正则表达式

Question

I have a text file that reads like this:- 我有一个文本文件，内容如下：

This recipe can be made either with a stand mixer, or by hand with a bowl, a 可以用立式搅拌机或用碗，
wooden spoon, and strong arms. 木勺和结实的手臂。 If you use salted butter, please omit the 如果您使用咸黄油，请省略
added salt in this recipe. 在这个食谱中加了盐。
Yum 百胜
Ingredients 配料
1 1/4 cups all-purpose flour (160 g) 1 1/4杯通用面粉（160克）
1/4 teaspoon salt 1/4茶匙盐
1/2 teaspoon baking powder 1/2茶匙发酵粉
1/2 cup unsalted butter (1 stick, or 8 Tbsp, or 112g) at room temperature 室温下1/2杯无盐黄油（1条或8汤匙或112克）
1/2 cup white sugar (90 g) 1/2杯白糖（90克）
1/2 cup dark brown sugar, packed (85 g) 1/2杯深红糖，包装（85克）
1 large egg 1个大鸡蛋
1 teaspoon vanilla extract 1茶匙香草精
1/2 teaspoon instant coffee granules or instant espresso powder 1/2茶匙速溶咖啡颗粒或速溶咖啡粉
1/2 cup chopped macadamia nuts (3 1/2 ounces, or 100 g) 1/2杯切碎的澳洲坚果（3 1/2盎司或100克）
1/2 cup white chocolate chips 1/2杯白巧克力片
Method 方法
1 Preheat the oven to 350°F (175°C). 1将烤箱预热至350°F（175°C）。 Vigorously whisk together the flour, 大力搅拌面粉，
and baking powder in a bowl and set aside. 和发酵粉放在碗里，放在一旁。

I want to extract the data between words Ingredients and Method. 我想提取“成分”和“方法”一词之间的数据。
I have written a regex (?s)(?<=\\bIngredients\\b).*?(?=\\bMethod\\b) 我写了一个正则表达式(?s)(?<=\\bIngredients\\b).*?(?=\\bMethod\\b)
to extract the data and it's working fine. 提取数据，并且工作正常。
But when I try to that using spark-shell like following, it doesn't give me 但是当我尝试像下面这样使用spark-shell时，它并没有给我
anything. 任何东西。

val b = sc.textFile("/home/akshat/file.txt")
val regex = "(?s)(?<=\bIngredients\b).*?(?=\bMethod\b)".r
regex.findAllIn(b).foreach(println)

Please tell me where I am going wrong and what steps should I take to 请告诉我我要去哪里错了，应该采取什么步骤
correct this? 纠正这个吗？
Thanks in advance! 提前致谢！

Answer 1

what you need to do is 您需要做的是

Read the file using WholeTextFiles (so it does not break lines and you read entire data together) 使用WholeTextFiles读取文件（因此它不会折行，并且您可以一起读取整个数据）
Write a function which takes a string and outputs a string using that regex so, it may look like (in python) 编写一个使用字符串并使用该正则表达式输出字符串的函数，因此，它看起来像（在python中）

Blockquote 大段引用

def getWhatIneed(s):
    output = <my regexp>
    return output

b = sc.WholeTextFiles(...)
c = b.map(getWhatIneed)

Now, c is also a RDD. 现在，c也是一个RDD。 You need to collect it before you print it. 您需要先收集它，然后再打印。 Output of collect is a normal array/list collect的输出是一个普通的数组/列表

print c.collect()

Apache Spark中的正则表达式

问题描述

1 个解决方案

解决方案1
1 2015-06-03 06:02:50

Apache Spark中的正则表达式

问题描述

1 个解决方案

解决方案1 1 2015-06-03 06:02:50

解决方案1
1 2015-06-03 06:02:50