简体   繁体   English

正则表达式找到特定单词后的所有单词?

[英]regex to find all words after specific word?

I have a string like below: 我有一个如下字符串:

Features:  -Includes hanging accessories.  -Artist: William-Adolphe Bouguereau.  -Made with 100pct cotton canvas.  -100pct Anti-shrink pine wood bars and Epson anti-fade ultra chrome inks.  -100pct Hand-made and inspected in the U.S.A.  -Orientation: Horizontal.  **Subject: -Figures/Nautical and beach.**  Gender: -Unisex/Both.  Size: -Mini 17'' and under/Small 18''-24''/Medium 25''-32''/Large 33''-40''/Oversized 41'' and above.  Style: -Fine art.  Color: -Blue.  Country of Manufacture: -United States.  Product Type: -Print of painting.  Region: -Europe.  Primary Art Material: -Canvas. Dimensions:  -8'' H x 12'' W x 0.75'' D: 0.72 lb.  -12'' H x 18'' W x 0.75'' D: 1.14 lbs.  -12'' H x 18'' W x 1.5'' D: 2.45 lbs.  -18'' H x 26'' W x 0.75'' D: 1.44 lbs.  Paintings Prints Tori White Wildon Photography Photos Posters Abstract Black D cor Designs Framed Hazelwood Hokku Home Landscape Oil Accent 075 12 15 18 26 40 60 8 D H W x 1 1017 1824 2532 holidays, christmas gift gifts for girls boys

I have to find the words after particular word. 我必须找到特定单词之后的单词。

I want to extract the words after the word "Subject" in above example. 我想提取上面例子中"Subject"一词之后的单词。

The output should be like below: 输出应如下所示:

Subject: -Figures/Nautical and beach.

I tried below regex: 我试过下面的正则表达式:

re.compile('(?<=subject)(.{30}(?:\s|.))',re.I)

But there is not fixed number of words after subject keyword to specify so I can't specify exact number of words. 但是,在指定主题关键字之后没有固定数量的单词,因此我无法指定单词的确切数量。

How do I stop at "peroid" or space.There is no specific stopping criterion. 如何停在“peroid”或space.There没有特定的停止标准。

Your (?<=subject)(.{30}(?:\\s|.)) regex asserts the position after subject . 你的(?<=subject)(.{30}(?:\\s|.))正则表达式断言subject之后的位置。 then grabs 30 characters other than a linebreak symbol and then matches either a whitespace or any character but a linebreak symbol. 然后抓取除了换行符号以外的30个字符,然后匹配空格或任何字符,但匹配换行符号。 This does not really fit your requirements as the substring can be of any length. 这不符合您的要求,因为子串可以是任何长度。

You may use alternation based regex with a capturing group: 您可以将基于交替的正则表达式与捕获组一起使用:

subject:\s*([^.]+|\S+)

See the regex demo 请参阅正则表达式演示

Details : 细节

  • subject: - literal subject: string subject: - 文字subject:字符串
  • \\s* - 0+ whitespaces \\s* - 0+空格
  • ([^.]+|\\S+) - Group 1 capturing 1 or more non-period symbols or 1+ non-whitespace symbols ([^.]+|\\S+) - 第1组捕获1个或多个非周期符号或1个非空白符号

Note : the order of the alternatives matters here since [^.]+ matches spaces, and \\S+ does not. 注意 :备选的顺序在这里很重要 ,因为[^.]+匹配空格,而\\S+则不匹配。 If the substring after \\s* starts with a dot, the \\S+ will match that substring up to a whitespace. 如果\\s*之后的子字符串以点开头,则\\S+将匹配该子字符串直到空格。

Python demo : Python演示

import re
p = re.compile(r'subject:\s*([^.]+|\S+)', re.IGNORECASE)
s = "Features:  -Includes hanging accessories.  -Artist: William-Adolphe Bouguereau.  -Made with 100pct cotton canvas.  -100pct Anti-shrink pine wood bars and Epson anti-fade ultra chrome inks.  -100pct Hand-made and inspected in the U.S.A.  -Orientation: Horizontal.  **Subject: -Figures/Nautical and beach.**  Gender: -Unisex/Both.  Size: -Mini 17'' and under/Small 18''-24''/Medium 25''-32''/Large 33''-40''/Oversized 41'' and above.  Style: -Fine art.  Color: -Blue.  Country of Manufacture: -United States.  Product Type: -Print of painting.  Region: -Europe.  Primary Art Material: -Canvas. Dimensions:  -8'' H x 12'' W x 0.75'' D: 0.72 lb.  -12'' H x 18'' W x 0.75'' D: 1.14 lbs.  -12'' H x 18'' W x 1.5'' D: 2.45 lbs.  -18'' H x 26'' W x 0.75'' D: 1.44 lbs.  Paintings Prints Tori White Wildon Photography Photos Posters Abstract Black D cor Designs Framed Hazelwood Hokku Home Landscape Oil Accent 075 12 15 18 26 40 60 8 D H W x 1 1017 1824 2532 holidays, christmas gift gifts for girls boys"
m = p.search(s)
if m:
    print(m.group())    # this includes Subject: 
    print(m.group(1))   # this does not include Subject: 

Try: 尝试:

re.compile('Subject: [^*]+')

Demo 演示

Regex: 正则表达式:

(Subject:.+)\*\*

Match Subject and content after that till '**'

Code: 码:

str = 'Features:  -Includes hanging accessories.  -Artist: William-Adolphe Bouguereau.  -Made with 100pct cotton canvas.  -100pct Anti-shrink pine wood bars and Epson anti-fade ultra chrome inks.  -100pct Hand-made and inspected in the U.S.A.  -Orientation: Horizontal.  **Subject: -Figures/Nautical and beach.**  Gender: -Unisex/Both.  Size: -Mini 17'' and under/Small 18''-24''/Medium 25''-32''/Large 33''-40''/Oversized 41'' and above.  Style: -Fine art.  Color: -Blue.  Country of Manufacture: -United States.  Product Type: -Print of painting.  Region: -Europe.  Primary Art Material: -Canvas. Dimensions:  -8'' H x 12'' W x 0.75'' D: 0.72 lb.  -12'' H x 18'' W x 0.75'' D: 1.14 lbs.  -12'' H x 18'' W x 1.5'' D: 2.45 lbs.  -18'' H x 26'' W x 0.75'' D: 1.44 lbs.  Paintings Prints Tori White Wildon Photography Photos Posters Abstract Black D cor Designs Framed Hazelwood Hokku Home Landscape Oil Accent 075 12 15 18 26 40 60 8 D H W x 1 1017 1824 2532 holidays, christmas gift gifts for girls boys'
import re

a = re.search(r'(Subject:.+)\*\*',str)
print(a.group(1))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM