简体   繁体   中英

a regular expression to grep specific paragraphs of a file

hi i am working on a shellscript.. suppose this is the data my shell script runs on

      Ownership
               o Australian Owned
   ?
   Ads for Mining Engineers
   232 results for
mining engineers in All States
   filtered by Mining Engineers [x] category
     * [ ]
                    [34]get directions
       Category:
       [35]Mining Engineers
       [36]Arrow Electrical Services in Wollongong, NSW under Mining
       Engineers logo
            [37]email
            [38]send to mobile
            [39]info
            Compare (0)
     * [ ]
       . [40]Firefly International
       Designers & Manufacturers. Service, Repair & Hire.
       We are the provider of mining engineers in Mt Thorley, NSW.
       25 Thrift Cl, Mt Thorley NSW 2330
       ph: (02) 6574 6660
            [41]http://www.fireflyint.com.au
            [42]get directions
       Category:
       [43]Mining Engineers
       [44]Firefly International in Mt Thorley, NSW under Mining Engineers
       logo
            [45]email
            [46]send to mobile
            [47]info
            Compare (0)
     * [ ]
       [48]Materials Solutions
       Materials Research & Development, Slurry Rheology & Piping Design.
       We are a well established company servicing the mining industry &
       associated manufacturing industries in all areas.
       Thornlie WA 6108
       ph: (08) 6468 4118
            [49]www.materialssolutions.com.au
       Category:
       [50]Mining Engineers
       [51]Materials Solutions in Thornlie, WA under Mining Engineers logo
            [52]email
            [53]send to mobile
            [54]info
            Compare (0)
     * [ ]
       . [55]ATC Williams Pty Ltd
       Our services are available from concept to completion of the works.
       Today, as the rebranded ATC Williams, we continue to expand our
       operations across Australia and in locations around the world.
       Unit 1, 21 Teddington Rd, Burswood WA 6100
       ph: (08) 9355 1383
            [56]www.atcwilliams.com.au
            [57]get directions
       Category:
       [58]Mining Engineers
       [59]ATC Williams Pty Ltd in Burswood, WA under Mining Engineers
       logo
            [60]email
            [61]send to mobile
            [62]info
            Compare (0)

and i need to grab addresses that look like this

 * [ ]
       . [55]ATC Williams Pty Ltd
       Our services are available from concept to completion of the works.
       Today, as the rebranded ATC Williams, we continue to expand our
       operations across Australia and in locations around the world.
       Unit 1, 21 Teddington Rd, Burswood WA 6100
       ph: (08) 9355 1383
            [56]www.atcwilliams.com.au

so what do i do.. i've been working on regular expressions like

^*(.?[\\w\\W?\\s?]*)+(.com.au)$

but thats not helping.. it matches the address when i give the input file with the address match i want.. but when given in bulk, it doesnt help. so can somebody help me out..

I see some issues with your regex

^*(.?[\w\W?\s?]*)+(.com.au)$
 ^ ^           ^ ^ ^   ^
 1 1           2 2 1   1
  1. special char's that need escaping

  2. greedy quantifier that match everything till the last ".com.au" , add a ? after the quantifier to make it ungreedy ==> match as less as possible (means till the first ".com.au" that is found at the row end).

    ==> This is your main problem

  3. You nest quantifiers *)+ , you don't need that

  4. In your example there is whitespace between the "*" and the ".", so either match for whitespace or remove the dot at all, it will be matched by your character class.

  5. There is also whitespace between the start of the row and the "*"

So, try this

    ^\s*\*([\w\W?\s?]*?)(\.com\.au)$

See it here on Regexr

Try this

^\s*\*\s*\[ \][^\*]+?[.]com[.]au$

explanation

^        # Assert position at the beginning of a line (at beginning of the string or after a line break character)
\s       # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   *        # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\*       # Match the character “*” literally
\s       # Match a single character that is a “whitespace character” (spaces, tabs, and line breaks)
   *        # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\[       # Match the character “[” literally
\        # Match the character “ ” literally
\]       # Match the character “]” literally
[^\*]    # Match any character that is NOT a * character
   +?       # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
[.]      # Match the character “.”
com      # Match the characters “com” literally
[.]      # Match the character “.”
au       # Match the characters “au” literally
$        # Assert position at the end of a line (at the end of the string or before a line break character)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM