简体   繁体   中英

I need to remove ~50 color names from a vehicle description field in a large data file using Java/RegEx without arrays or loops

I'm using a data integration tool that allows Java to help transform data. The problem is that I can't create arrays or use loops because they are not supported.

I have fields that have String values like:

04 Blue Honda Accord
12 Inferno Red Chevrolet Tahoe
10 Purple Ford Taurus

I just need to colors stripped out so they'd be:

04 Honda Accord
12 Chevrolet Tahoe
10 Ford Taurus

The only thing I can think of is creating a regular expression with the 49 color names, but I'm pretty bad at regex.

UPDATED to eliminate the extra space in the output (changed (.+) to [SPACE](.+) ).

Assuming each car is on a separate line, this works:

^(\d+ )(?:Blue|Red|Purple|Inferno Red) (.+)$

正则表达式可视化

Debuggex Demo

Use

sNumColorCarLine.replaceAll(sTheRegex, "$1$2");

or

sNumColorCarLine.replaceFirst(sTheRegex, "$1$2");

to make the replacement. To be efficient though, especially if there are a lot of data lines, use the following, which avoids re-compiling the pattern (and re-creating the matcher) for every line:

import  java.util.regex.Pattern;
import  java.util.regex.Matcher;

/**
   <P>{@code java RemoveColorFromCarLinesNoLoops}</P>
 **/
public class RemoveColorFromCarLinesNoLoops  {
   public static final void main(String[] igno_red)  {

      //Add colors as necessary
      String sColorsNonCaptureOr = "(?:Blue|Red|Purple|Inferno Red)";
      String sRegex = "" +
         "^(\\d+ )" +            //one-or-more digits, then one space
         sColorsNonCaptureOr +   //color
         " (.+)$";               //Everything after the color (space uncaptured)

      String sRplcWith = "$1$2";

         //"": Unused search-string, so matcher can be reused.
      Matcher m = Pattern.compile(sRegex).matcher("");

      String sColorRemoved1 = removeColorFromCarLine(m, "04 Blue Honda Accord", sRplcWith);
      String sColorRemoved2 = removeColorFromCarLine(m, "12 Inferno Red Chevrolet Tahoe", sRplcWith);
      String sColorRemoved3 = removeColorFromCarLine(m, "10 Purple Ford Taurus", sRplcWith);
   }
   private static final String removeColorFromCarLine(Matcher m_m, String s_origCarLine, String s_rplcWith)  {
      m_m.reset(s_origCarLine);
      if(!m_m.matches())  {
         throw  new IllegalArgumentException("Does not match: \"" + " + s_origCarLine + " + "\", pattern=[" + m_m.pattern() + "]");
      }

      //Since it matches(s), this is equivalent to "replace the entire line, as a whole"
      String s = m_m.replaceFirst(s_rplcWith);
      System.out.println(s_origCarLine + "  -->  " + s);
      return  s;
   }
}

Output

[C:\java_code\]java RemoveColorFromCarLinesNoLoops
04 Blue Honda Accord  -->  04 Honda Accord
12 Inferno Red Chevrolet Tahoe  -->  12 Chevrolet Tahoe
10 Purple Ford Taurus  -->  10 Ford Taurus

Assuming that each field has only one value in form digits colours description ou can try something like

text = text.replaceFirst("(?i)(?<=\\d)\\s(red|green|blue)\\b","");
  • (?i) will mare regex case insensitive so "red" will match "Red" or "RED" or "ReD" and so on
  • (?<=\\\\d) will use look-behind mechanism to check if before actual match will be digit
  • \\\\s represents whitespace
  • (red|green|blue) means red OR green OR blue
  • \\\\b represents word boundary to check if color will not be part of next word in description like greenhouses

BTW if you have some colors that are part of some other colors like gray and gray-red then be sure to put the most specific before like gray-red|gray . It is important because regex will try to find match based on order from left to right, so if your text will contain 12 gray-red Mercedes and you will use replaceAll("gray|gray-red","") you will see as result 12 -red Mercedes because gray could be (and was) matched before gray-red .

This way you don't need to specify the color, and works in case of the color name has more than one word:

String field = "12 Inferno Red Chevrolet Tahoe";
field = field.replaceFirst("(\\d+).*(\\w+\\s\\w+)", "$1 $2");
System.out.println(field);

Prints:

12 Chevrolet Tahoe

The regex maintains only the number and the last two words of the input.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM