简体   繁体   English

使用正则表达式提取数据

[英]Data Extraction using Regex

I have data in a text file "file.txt" 我在文本文件“ file.txt”中有数据

Recipes & Menus 食谱和菜单
Expert Advice 专家建议
Ingredients 配料
Holidays & Events 假期与活动
Community 社区
Video 视频
SUMMER COOKING 夏季烹饪
Lentil and Brown Rice Soup 小扁豆和糙米汤
Gourmet January 1991 美食1991年1月
3.5/4 3.5 / 4
reviews (83) 评论(83)
90% 90%
make it again 再做一次
Some soups genuinely do inspire a devotion akin to love, and this is one of 确实有一些汤确实激发了对爱情的热爱,这是其中之一
them. 他们。 In the cold of winter, when Gourmet editors ponder the matter of what soup 在寒冷的冬季,美食家编辑思考什么汤的问题
Cook 厨师
Reviews (83) 评论(83)
YIELD: Makes about 14 cups, serving 6 to 8 得率:制作约14杯,每6至8杯
Ingredients 配料
5 cups chicken broth 5杯鸡汤
1 1/2 cups lentils, picked over and rinsed 1 1/2杯小扁豆,捞起并冲洗
1 cup brown rice 1杯糙米
a 32- to 35-ounce can tomatoes, drained, reserving the juice, and chopped 将32至35盎司的番茄罐沥干,保留汁液,切碎
3 carrots, halved lengthwise and cut crosswise into 1/4-inch pieces 3根胡萝卜,纵向切成两半,切成1/4英寸的小块
1 onion, chopped 1根切碎的洋葱
1 stalk of celery, chopped 1根芹菜杆,切碎
3 garlic cloves, minced 3瓣蒜末
1/2 teaspoon crumbled dried basil 1/2茶匙碎罗勒干
1/2 teaspoon crumbled dried orégano 1/2茶匙碎的牛至
1/4 teaspoon crumbled dried thyme 1/4茶匙碎百里香
1 bay leaf 1月桂叶
1/2 cup minced fresh parsley leaves 1/2杯切碎的新鲜欧芹叶
2 tablespoons cider vinegar, or to taste 2汤匙苹果醋,或调味
Preparation 制备
In a heavy kettle combine the broth, 3 cups water, the lentils, the rice, the tomatoes with the reserved juice, 在一个大水壶中,将汤,3杯水,扁豆,米饭,西红柿和保留的汁液混合在一起,

I want to extract the data between Ingredients and Preparation . 我想提取成分制备之间的数据。
I had written the following regex for it :- 我为此写了以下正则表达式:-

(?s).*?Ingredients(.*?)Preparation.*

But it's extracting the data between the Ingredients in italics on 3rd line of 但这是在第3行的斜体中提取成分之间的数据
file.txt and Preparation but not between the data between Ingredients and Preparation file.txt和制剂之间,但不在成分制剂之间的数据之间
What changes in my regex code should I do to resolve this problem? 我应该对我的正则表达式代码进行哪些更改才能解决此问题?
Thanks in advance! 提前致谢!

You can make use of a lazy quantifier .*? 您可以使用惰性量词.*? with the second .* : 第二个.*

(?s).*\bIngredients\b(.*?)\bPreparation\b

See demo 观看演示

Or you can make use of a tempered greedy token and then you do not need the first .* : 或者,您可以使用经过调和的贪婪令牌 ,然后就不需要第一个.*

(?s)\bIngredients\b(?:(?!\b(?:Ingredients|Preparation)\b).)*\bPreparation\b

See demo 观看演示

(?s).*?[*]{2}Ingredients[*]{2}(.*?)[*]{2}Preparation[*]{2}.*

[*]{2} tell the regex you want one of the chars in the list (here a single * ) excatly twice {2} . [*]{2}告诉你,你在列表中(这里单的字符中的一个正则表达式* )excatly两次{2}

I prefer using character classes than escaping, I found them more readable than this: 我喜欢使用字符类而不是转义,我发现它们比这更可读:

(?s).*?\*{2}Ingredients\*{2}(.*?)\*{2}Preparation\*{2}.*

and depending on the language you're using you may have to escape the backslash too. 并且根据您使用的语言,您可能也必须转义反斜杠。

You can use a lookahead that checks that each line is not Ingredients . 您可以使用前瞻性检查每个行是否不是Ingredients In this way you limit the number of tests to only the start of lines (instead of testing each characters): 这样,您可以将测试次数限制为仅行的开头(而不是测试每个字符):

(?m)^Ingredients\R((?:(?!Ingredients$).*\R)+?)Preparation$ 

demo 演示

pattern details: 图案细节:

(?m)             # switch on the multiline mode (^ and $ match the limit of the line)
^Ingredients\R   # "Ingredients" at the start of the line followed by a new line
(   # capture group 1
    (?:          # open a non-capturing group
        (?!Ingredients$) # negative lookahead to check that the line is not "Ingredients"
        .*\R             # the line
    )+? # repeat until "Preparation"
)
Preparation$

Note: since you didn't say what regex engine you use, it is possible that \\R is not supported. 注意:由于您没有说明要使用的正则表达式引擎,因此可能不支持\\R In this case, replace it with \\r?\\n . 在这种情况下,请将其替换为\\r?\\n

Try making your first .* greedy. 尝试使您的第一个.*贪婪。 It will eat all Ingredients up until the last one before Preparation : 它将在Preparation之前吃掉所有Ingredients直至最后一种Ingredients

(?s).*Ingredients(.*?)Preparation.*

Demo: https://regex101.com/r/mQ5eK5/1 演示: https//regex101.com/r/mQ5eK5/1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM