简体   繁体   中英

How to use grep with regex and patterns from a file together?

Suppose there is a file contains a lot of patterns

.com
.re
.net
...

And there is a file contains a lot of data

www.recent
www.remix3d.com
www.verisign.net

What I want the outcome is that

www.remix3d.com
www.verisign.net

I use command grep -f pattern_file data_file , but the outcome is like that

www.recent
www.remix3d.com
www.verisign.net

Since the .re match the www'.re'cent

How can I specific the patterns in the file can work together with general regex? Such as I grep the data 'end with' specific patterns and the patterns come from the pattern file.

The pattern file must contain patterns (ie with properly escaped special character), I suggest to modify your pattern file like this:

\.com$
\.ru$
\.net$

If you don't want to change the pattern file, then you must do the escapes externally. Check this out.

> cat pattern
.com
.re
.net
> cat pattern_data
www.recent
www.remix3d.com
www.verisign.net
> grep $(sed 's/$/$/g;s/^/\\/g' pattern | tr '\n' '|' | sed 's/.$//g;s/|/\\|/g') pattern_data
www.remix3d.com
www.verisign.net
>

Note that there are preexisting tools for this kind of matching on domain names, for processing the public suffix list . There are many libraries available for processing it, and some of them are heavily optimized and will be much faster than processing a list of regular expressions if the list of suffixes is large.

It sounds like your criteria are actually:

  • The pattern file is actually a list of STRINGS rather than a list of regular expressions (in which a dot ( . ) matches any single character),
  • The patterns are intended to be matched only at the ENDS of strings (so there's an implicit $ at the end of each line in the pattern file).

To meet the first criterion, you can use grep's -F option:

$ grep -F -f pattern_file data_file

But this doesn't help with the .re , which is embedded mid-way through one of the lines. If you can modify your pattern file, changing the lines to look like:

\.com$
\.re$
\.net$

would turn them into the regular expressions you want. Otherwise, you might have to use something to PARSE that pattern file in order to create the regex you're looking for. For example, using a bash array and some Parameter Expansion:

$ mapfile -t a < pattern_file
$ declare -p a
declare -a a=([0]=".com" [1]=".re" [2]=".net")
$ printf -v new_re '|%s' "${a[@]}"
$ new_re="${new_re//./\\.}"         # escape dots within regex
$ new_re="(${new_re:1})\$"          # strip leading or-bar
$ echo "$new_re"
(\.com|\.re|\.net)$
$ grep -E "$new_re" data_file
www.remix3d.com
www.verisign.net

Or if you don't mind relying on one more tool to reduce the line count:

$ grep -f <(sed 's/\./\\./g;s/$/$/' pat) file
www.remix3d.com
www.verisign.net

You may use grep -f with a sed in process substitution that converts each extension in pattern_file to a proper regex:

grep -f <(sed 's/.*/\\&$/' pattern_file) data_file

www.remix3d.com
www.verisign.net

Output of sed command is:

sed 's/.*/\\&$/' pattern_file

\.com$
\.re$
\.net$

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM