简体   繁体   English

Apache Spark中的正则表达式

[英]Regex in Apache Spark

I have a text file that reads like this:- 我有一个文本文件,内容如下:

This recipe can be made either with a stand mixer, or by hand with a bowl, a 可以用立式搅拌机或用碗,
wooden spoon, and strong arms. 木勺和结实的手臂。 If you use salted butter, please omit the 如果您使用咸黄油,请省略
added salt in this recipe. 在这个食谱中加了盐。
Yum 百胜
Ingredients 配料
1 1/4 cups all-purpose flour (160 g) 1 1/4杯通用面粉(160克)
1/4 teaspoon salt 1/4茶匙盐
1/2 teaspoon baking powder 1/2茶匙发酵粉
1/2 cup unsalted butter (1 stick, or 8 Tbsp, or 112g) at room temperature 室温下1/2杯无盐黄油(1条或8汤匙或112克)
1/2 cup white sugar (90 g) 1/2杯白糖(90克)
1/2 cup dark brown sugar, packed (85 g) 1/2杯深红糖,包装(85克)
1 large egg 1个大鸡蛋
1 teaspoon vanilla extract 1茶匙香草精
1/2 teaspoon instant coffee granules or instant espresso powder 1/2茶匙速溶咖啡颗粒或速溶咖啡粉
1/2 cup chopped macadamia nuts (3 1/2 ounces, or 100 g) 1/2杯切碎的澳洲坚果(3 1/2盎司或100克)
1/2 cup white chocolate chips 1/2杯白巧克力片
Method 方法
1 Preheat the oven to 350°F (175°C). 1将烤箱预热至350°F(175°C)。 Vigorously whisk together the flour, 大力搅拌面粉,
and baking powder in a bowl and set aside. 和发酵粉放在碗里,放在一旁。

I want to extract the data between words Ingredients and Method. 我想提取“成分”和“方法”一词之间的数据。
I have written a regex (?s)(?<=\\bIngredients\\b).*?(?=\\bMethod\\b) 我写了一个正则表达式(?s)(?<=\\bIngredients\\b).*?(?=\\bMethod\\b)
to extract the data and it's working fine. 提取数据,并且工作正常。
But when I try to that using spark-shell like following, it doesn't give me 但是当我尝试像下面这样使用spark-shell时,它并没有给我
anything. 任何东西。

val b = sc.textFile("/home/akshat/file.txt")
val regex = "(?s)(?<=\bIngredients\b).*?(?=\bMethod\b)".r
regex.findAllIn(b).foreach(println)

Please tell me where I am going wrong and what steps should I take to 请告诉我我要去哪里错了,应该采取什么步骤
correct this? 纠正这个吗?
Thanks in advance! 提前致谢!

what you need to do is 您需要做的是

  1. Read the file using WholeTextFiles (so it does not break lines and you read entire data together) 使用WholeTextFiles读取文件(因此它不会折行,并且您可以一起读取整个数据)
  2. Write a function which takes a string and outputs a string using that regex so, it may look like (in python) 编写一个使用字符串并使用该正则表达式输出字符串的函数,因此,它看起来像(在python中)

Blockquote 大段引用

def getWhatIneed(s):
    output = <my regexp>
    return output

b = sc.WholeTextFiles(...)
c = b.map(getWhatIneed)

Now, c is also a RDD. 现在,c也是一个RDD。 You need to collect it before you print it. 您需要先收集它,然后再打印。 Output of collect is a normal array/list collect的输出是一个普通的数组/列表

print c.collect()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM