简体   繁体   中英

What's the simplest way to remove extraneous leading numbers?

I have data that is reliably in this format:

    1. New York Times - USA
    2. Guardian - UK
    3. Le Monde - France

I'm using this code to parse out the newspaper and country values:

    String newspaper = "";
    String country = "";
    int hyphenIndex = unparsedText.indexOf("-");
    if (hyphenIndex > -1)
    {
        newspaper = unparsedText.substring(0, hyphenIndex);
    }
    country = unparsedText.substring(hyphenIndex + 1, unparsedText.length());
    country = country.trim();

But this produces newspaper values of:

    1. New York Times
    2. Guardian
    3. Le Monde

What's the simplest change to make to end up with newspaper values of:

    New York Times
    Guardian
    Le Monde

Here is a regex based solution:

input.replaceAll("(?m)^\\d+\\.\\s*|\\s*-\\s*.*?$", "");

The regex works in multiline mode (?m) and deletes:

  • Leading digit(s) followed by a dot followed by any number of space.
  • Hyphen followed by anything.

I'm assuming there are no hyphens in the newspaper name.

Code In Action

Surely just find the index of the first '.' and use substring(from,to) to get the bit in the middle.

Something like:

String newspaper = "";
String country = "";
int hyphenIndex = unparsedText.indexOf("-");
int dotIndex = unparsedText.indexOf(".");
if (hyphenIndex > -1)
{
    newspaper = unparsedText.substring(dotIndex + 1, hyphenIndex);
}
country = unparsedText.substring(hyphenIndex + 1, unparsedText.length());
country = country.trim();

If it really is reliably in that format, it seems that the easiest (and likely most efficient) way to do this would be to find the first instance of the . character, and then take a substring starting from dotIndex + 1 . In fact you could combine this with your current substring operation (based on the position of the dash) to extract the newspaper name in one go.

If the format is a little less reliable, you could use a regex to match digits followed by a separator character followed by whitespace, and remove that. But in this case, that seems like overkill.

If the entries all follow the format you gave you could look for the full stop after the number eg

int dotIndex = unparsedText.indexOf(".");

and then

newspaper = unparsedText.substring(dotIndex + 2, hyphenIndex - 1);

Note: that you want to start 2 characters after the . and exclude the 1 space before the - or use trim()

java.util.regex.Matcher m = (new java.util.regex.Pattern("[a-zA-Z ]*")).matcher(unparsedText);
m.find();
System.err.println(unparsedText.substring(m.start(), m.end());

Note #1: assuming newspaper cannot contain numbers.

Note #2: haven't tested.

String#split(String regex) would work if you split on . and - .

[0] => "1"
[1] => " New York Times "
[2] => " USA"

Then just trim the results you want.

This regex should work:

    Pattern pattern =  Pattern.compile("\\d+.\\s(.*)\\s-.*");
    Matcher matcher = pattern.matcher("1. New Your Times - USA");
    String newspaper = matcher.toMatchResult().group(1);
    Assert.assertEquals("New Your Times", newspaper);

I would do it like this:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Application
{
    public static void main ( final String[] args )
    {
        final String[] lines = new String[] { "1. New York Times - USA", "2. Guardian - UK", "3. Le Monde - France" };

        final Pattern p = Pattern.compile ( "\\.\\s+(.*?)\\s+-\\s+(.*)" );

        for ( final String unparsedText : lines )
        {
            String newspaper;
            String country;

            final Matcher m = p.matcher ( unparsedText );

            if ( m.find () )
            {
                newspaper = m.group ( 1 );
                country = m.group ( 2 );

                System.out.println ( "Newspaper: " + newspaper + " Country: " + country );
            }
        }
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM