简体   繁体   中英

regex match if line matches or starts and ends with angle brackets

I'm processing a file line by line with ruby(pcre regex) and the idea is to count how many lines are used excluding page markers, empty lines and markup tags

 1. [==| Page 4 |==]
 2.
 3. 上側
 4.
 5. 勉州爛 夜 菌
 6.
 7. 洲⑪蝿 香n
 8.
 9. 本聘
10.
11. [==| Page 5 |==]
12.
13. <IMAGE
14. <IMAGE>
15. IMAGE>
16.
17. [==| Page 6 |==]
18.
19. 欝輛蓼 \縄《卿⑪儡

I know how to use ignore headings and and empty lines with this regex /^(?.\[==\| Page \d+ \|==\]).+$/

but I'm not quite sure how to also ignore tags. the regex to match these tags that I'm using are as simple as /^<.*>$/ , and I'm not sure how to invert it.

the result after scanning should be ["上側", "勉州爛 夜 菌", "洲⑪蝿 香n", "本聘", "<IMAGE", "IMAGE>", "欝輛蓼 \縄《卿⑪儡"].length #=> 7

Chaining Inverted Matches

You have a number of ways to invert matches in Ruby, includingEnumerable#grep_v andEnumerable#reject . While you could certainly do it as a complex regular expression, that makes your code much less testable and harder to read. Instead, leverage some of the core methods to build up your logic and/or regex patterns, rather than using one complex regular expression.

For example, assuming you've slurped your file into a file variable:

page_marker = /\[==\| Page \d+ \|==\]/
tag_markers = /^<.*?>$/

file.lines.map(&:chomp).
  grep_v(page_marker).
  grep_v(tag_markers).
  reject { |line| line.empty? }.
  count

#=> 7

Granted that there are many other ways to express this, the chained method approach has the key benefits of:

  1. Being fairly readable.
  2. Clearly communicating the step-wise intent of the code.
  3. Being composable, and therefore easy to modify or extend.
  4. Allowing you to see the results of each step in the method chain in irb if you need to debug it.

Other answers may steer you towards negative or positive lookahead/lookbehind assertions , but for maintainability and testability I'd strongly encourage a more composeable approach.

you can use or in look ahead expression to exclude other lines too!

^(?.\[==\| Page \d+ \|==\]|$|<.*>).*$

Regex Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM