[英]Data Extraction using Regex
I have data in a text file "file.txt" 我在文本文件“ file.txt”中有数据
Recipes & Menus
食谱和菜单
Expert Advice专家建议
Ingredients配料
Holidays & Events假期与活动
Community社区
Video视频
SUMMER COOKING夏季烹饪
Lentil and Brown Rice Soup小扁豆和糙米汤
Gourmet January 1991美食1991年1月
3.5/43.5 / 4
reviews (83)评论(83)
90%90%
make it again再做一次
Some soups genuinely do inspire a devotion akin to love, and this is one of确实有一些汤确实激发了对爱情的热爱,这是其中之一
them.他们。 In the cold of winter, when Gourmet editors ponder the matter of what soup
在寒冷的冬季,美食家编辑思考什么汤的问题
Cook厨师
Reviews (83)评论(83)
YIELD: Makes about 14 cups, serving 6 to 8得率:制作约14杯,每6至8杯
Ingredients配料
5 cups chicken broth5杯鸡汤
1 1/2 cups lentils, picked over and rinsed1 1/2杯小扁豆,捞起并冲洗
1 cup brown rice1杯糙米
a 32- to 35-ounce can tomatoes, drained, reserving the juice, and chopped将32至35盎司的番茄罐沥干,保留汁液,切碎
3 carrots, halved lengthwise and cut crosswise into 1/4-inch pieces3根胡萝卜,纵向切成两半,切成1/4英寸的小块
1 onion, chopped1根切碎的洋葱
1 stalk of celery, chopped1根芹菜杆,切碎
3 garlic cloves, minced3瓣蒜末
1/2 teaspoon crumbled dried basil1/2茶匙碎罗勒干
1/2 teaspoon crumbled dried orégano1/2茶匙碎的牛至
1/4 teaspoon crumbled dried thyme1/4茶匙碎百里香
1 bay leaf1月桂叶
1/2 cup minced fresh parsley leaves1/2杯切碎的新鲜欧芹叶
2 tablespoons cider vinegar, or to taste2汤匙苹果醋,或调味
Preparation制备
In a heavy kettle combine the broth, 3 cups water, the lentils, the rice, the tomatoes with the reserved juice,在一个大水壶中,将汤,3杯水,扁豆,米饭,西红柿和保留的汁液混合在一起,
I want to extract the data between Ingredients and Preparation . 我想提取成分和制备之间的数据。
I had written the following regex for it :- 我为此写了以下正则表达式:-
(?s).*?Ingredients(.*?)Preparation.*
But it's extracting the data between the Ingredients in italics on 3rd line of 但这是在第3行的斜体中提取成分之间的数据
file.txt and Preparation but not between the data between Ingredients and Preparation file.txt和制剂之间,但不在成分和制剂之间的数据之间
What changes in my regex code should I do to resolve this problem? 我应该对我的正则表达式代码进行哪些更改才能解决此问题?
Thanks in advance! 提前致谢!
You can make use of a lazy quantifier .*?
您可以使用惰性量词
.*?
with the second .*
: 第二个
.*
:
(?s).*\bIngredients\b(.*?)\bPreparation\b
Or you can make use of a tempered greedy token and then you do not need the first .*
: 或者,您可以使用经过调和的贪婪令牌 ,然后就不需要第一个
.*
:
(?s)\bIngredients\b(?:(?!\b(?:Ingredients|Preparation)\b).)*\bPreparation\b
(?s).*?[*]{2}Ingredients[*]{2}(.*?)[*]{2}Preparation[*]{2}.*
[*]{2}
tell the regex you want one of the chars in the list (here a single *
) excatly twice {2}
. [*]{2}
告诉你,你在列表中(这里单的字符中的一个正则表达式*
)excatly两次{2}
I prefer using character classes than escaping, I found them more readable than this: 我喜欢使用字符类而不是转义,我发现它们比这更可读:
(?s).*?\*{2}Ingredients\*{2}(.*?)\*{2}Preparation\*{2}.*
and depending on the language you're using you may have to escape the backslash too. 并且根据您使用的语言,您可能也必须转义反斜杠。
You can use a lookahead that checks that each line is not Ingredients
. 您可以使用前瞻性检查每个行是否不是
Ingredients
。 In this way you limit the number of tests to only the start of lines (instead of testing each characters): 这样,您可以将测试次数限制为仅行的开头(而不是测试每个字符):
(?m)^Ingredients\R((?:(?!Ingredients$).*\R)+?)Preparation$
pattern details: 图案细节:
(?m) # switch on the multiline mode (^ and $ match the limit of the line)
^Ingredients\R # "Ingredients" at the start of the line followed by a new line
( # capture group 1
(?: # open a non-capturing group
(?!Ingredients$) # negative lookahead to check that the line is not "Ingredients"
.*\R # the line
)+? # repeat until "Preparation"
)
Preparation$
Note: since you didn't say what regex engine you use, it is possible that \\R
is not supported. 注意:由于您没有说明要使用的正则表达式引擎,因此可能不支持
\\R
In this case, replace it with \\r?\\n
. 在这种情况下,请将其替换为
\\r?\\n
。
Try making your first .*
greedy. 尝试使您的第一个
.*
贪婪。 It will eat all Ingredients
up until the last one before Preparation
: 它将在
Preparation
之前吃掉所有Ingredients
直至最后一种Ingredients
:
(?s).*Ingredients(.*?)Preparation.*
Demo: https://regex101.com/r/mQ5eK5/1 演示: https : //regex101.com/r/mQ5eK5/1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.