简体   繁体   中英

Java Regex to extract specific words

I am trying to extract all the presence of 'and', 'a', 'the', 'an','& amp ;' from a block of text along with all the presence of digits.

I tried to create different regex for that purpose but fail to get the accurate result.

All the digits are extracted fine but I am unable to fetch all the aforementioned strings through regex.

My basic regex was

 Pattern p = Pattern.compile("^[0-9]");

then I tried different combinations like

 Pattern p = Pattern.compile("^[0-9](&)");
 Pattern p = Pattern.compile("^[0-9]+[&]");

to get aforementioned strings but of no use.

Example of the text :

System requirements: iOS 6.0 and Android (varies) &
Version used in this guide: 2.2.4 (iPhone), 13.1.2 (Android)

Expected Result

 6.0,and,&,2.2.4,13.1.2

You are nowhere even close with your "attempts" and I almost feel bad for just handing you the solution, but if you really are "keen to learn new things" (as you say in your SO profile), have a look at a regex tutorial.

A basic use of alternation , grouping , quantifiers and anchors (/ word boundaries ) will solve your problem.

(\b(?:a|an|and|the)\b|&|\d+(?:\.\d+)*)

Explanation:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      a                        'a'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      an                       'an'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      and                      'and'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      the                      'the'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    \b                       the boundary between a word char (\w)
                             and something that is not a word char
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    &                    '&'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      \.                       '.'
--------------------------------------------------------------------------------
      \d+                      digits (0-9) (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of \1

For use in Java, you would have to escape every \\ .

(\\b(?:a|an|and|the)\\b|&|\\d+(?:\\.\\d+)*)

You can use the following regex:

(\\ban?d?\\b|\\bthe\\b|\\B&\\B|[\\d.]+)

See DEMO

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM