I'm processing a file line by line with ruby(pcre regex) and the idea is to count how many lines are used excluding page markers, empty lines and markup tags
1. [==| Page 4 |==]
2.
3. 上側
4.
5. 勉州爛 夜 菌
6.
7. 洲⑪蝿 香n
8.
9. 本聘
10.
11. [==| Page 5 |==]
12.
13. <IMAGE
14. <IMAGE>
15. IMAGE>
16.
17. [==| Page 6 |==]
18.
19. 欝輛蓼 \縄《卿⑪儡
I know how to use ignore headings and and empty lines with this regex /^(?.\[==\| Page \d+ \|==\]).+$/
but I'm not quite sure how to also ignore tags. the regex to match these tags that I'm using are as simple as /^<.*>$/
, and I'm not sure how to invert it.
the result after scanning should be ["上側", "勉州爛 夜 菌", "洲⑪蝿 香n", "本聘", "<IMAGE", "IMAGE>", "欝輛蓼 \縄《卿⑪儡"].length #=> 7
You have a number of ways to invert matches in Ruby, includingEnumerable#grep_v andEnumerable#reject . While you could certainly do it as a complex regular expression, that makes your code much less testable and harder to read. Instead, leverage some of the core methods to build up your logic and/or regex patterns, rather than using one complex regular expression.
For example, assuming you've slurped your file into a file variable:
page_marker = /\[==\| Page \d+ \|==\]/
tag_markers = /^<.*?>$/
file.lines.map(&:chomp).
grep_v(page_marker).
grep_v(tag_markers).
reject { |line| line.empty? }.
count
#=> 7
Granted that there are many other ways to express this, the chained method approach has the key benefits of:
Other answers may steer you towards negative or positive lookahead/lookbehind assertions , but for maintainability and testability I'd strongly encourage a more composeable approach.
you can use or in look ahead expression to exclude other lines too!
^(?.\[==\| Page \d+ \|==\]|$|<.*>).*$
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.