What's the simplest way to remove extraneous leading numbers?

Question

I have data that is reliably in this format:

    1. New York Times - USA
    2. Guardian - UK
    3. Le Monde - France

I'm using this code to parse out the newspaper and country values:

    String newspaper = "";
    String country = "";
    int hyphenIndex = unparsedText.indexOf("-");
    if (hyphenIndex > -1)
    {
        newspaper = unparsedText.substring(0, hyphenIndex);
    }
    country = unparsedText.substring(hyphenIndex + 1, unparsedText.length());
    country = country.trim();

But this produces newspaper values of:

    1. New York Times
    2. Guardian
    3. Le Monde

What's the simplest change to make to end up with newspaper values of:

    New York Times
    Guardian
    Le Monde

Answer 1

Here is a regex based solution:

input.replaceAll("(?m)^\\d+\\.\\s*|\\s*-\\s*.*?$", "");

The regex works in multiline mode (?m) and deletes:

Leading digit(s) followed by a dot followed by any number of space.
Hyphen followed by anything.

I'm assuming there are no hyphens in the newspaper name.

Code In Action

Answer 2

Surely just find the index of the first '.' and use substring(from,to) to get the bit in the middle.

Something like:

String newspaper = "";
String country = "";
int hyphenIndex = unparsedText.indexOf("-");
int dotIndex = unparsedText.indexOf(".");
if (hyphenIndex > -1)
{
    newspaper = unparsedText.substring(dotIndex + 1, hyphenIndex);
}
country = unparsedText.substring(hyphenIndex + 1, unparsedText.length());
country = country.trim();

Answer 3

If it really is reliably in that format, it seems that the easiest (and likely most efficient) way to do this would be to find the first instance of the . character, and then take a substring starting from dotIndex + 1 . In fact you could combine this with your current substring operation (based on the position of the dash) to extract the newspaper name in one go.

If the format is a little less reliable, you could use a regex to match digits followed by a separator character followed by whitespace, and remove that. But in this case, that seems like overkill.

Answer 4

If the entries all follow the format you gave you could look for the full stop after the number eg

int dotIndex = unparsedText.indexOf(".");

and then

newspaper = unparsedText.substring(dotIndex + 2, hyphenIndex - 1);

Note: that you want to start 2 characters after the . and exclude the 1 space before the - or use trim()

Answer 5

java.util.regex.Matcher m = (new java.util.regex.Pattern("[a-zA-Z ]*")).matcher(unparsedText);
m.find();
System.err.println(unparsedText.substring(m.start(), m.end());

Note #1: assuming newspaper cannot contain numbers.

Note #2: haven't tested.

Answer 6

String#split(String regex) would work if you split on . and - .

[0] => "1"
[1] => " New York Times "
[2] => " USA"

Then just trim the results you want.

Answer 7

This regex should work:

    Pattern pattern =  Pattern.compile("\\d+.\\s(.*)\\s-.*");
    Matcher matcher = pattern.matcher("1. New Your Times - USA");
    String newspaper = matcher.toMatchResult().group(1);
    Assert.assertEquals("New Your Times", newspaper);

Answer 8

I would do it like this:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Application
{
    public static void main ( final String[] args )
    {
        final String[] lines = new String[] { "1. New York Times - USA", "2. Guardian - UK", "3. Le Monde - France" };

        final Pattern p = Pattern.compile ( "\\.\\s+(.*?)\\s+-\\s+(.*)" );

        for ( final String unparsedText : lines )
        {
            String newspaper;
            String country;

            final Matcher m = p.matcher ( unparsedText );

            if ( m.find () )
            {
                newspaper = m.group ( 1 );
                country = m.group ( 2 );

                System.out.println ( "Newspaper: " + newspaper + " Country: " + country );
            }
        }
    }
}

What's the simplest way to remove extraneous leading numbers?

Question

8 answers

solution1
4 ACCPTED 2010-11-11 15:24:54

solution2
2 2010-11-11 15:12:37

solution3
1 2010-11-11 15:13:04

solution4
1 2010-11-11 15:14:21

solution5
1 2010-11-11 15:16:49

Note #1: assuming newspaper cannot contain numbers.

Note #2: haven't tested.

solution6
1 2010-11-11 15:22:12

solution7
1 2010-11-11 15:27:47

solution8
1 2010-11-11 15:30:02

What's the simplest way to remove extraneous leading numbers?

Question

8 answers

solution1 4 ACCPTED 2010-11-11 15:24:54

solution2 2 2010-11-11 15:12:37

solution3 1 2010-11-11 15:13:04

solution4 1 2010-11-11 15:14:21

solution5 1 2010-11-11 15:16:49

Note #1: assuming newspaper cannot contain numbers.

Note #2: haven't tested.

solution6 1 2010-11-11 15:22:12

solution7 1 2010-11-11 15:27:47

solution8 1 2010-11-11 15:30:02

solution1
4 ACCPTED 2010-11-11 15:24:54

solution2
2 2010-11-11 15:12:37

solution3
1 2010-11-11 15:13:04

solution4
1 2010-11-11 15:14:21

solution5
1 2010-11-11 15:16:49

solution6
1 2010-11-11 15:22:12

solution7
1 2010-11-11 15:27:47

solution8
1 2010-11-11 15:30:02