简体   繁体   中英

Java regex to find all words in a splitted url path

So i have this url path which i splitted by "/". Example: Complite url path: https://www.uni.it/it/ateneo-org_plot-pesc/organ/organi-amm/rettore-o_0-rect Splitted path looks like this:

it
ateneo-org_plot-
organ
organi-amm
rettore-o_0-rect

The output i want is:

it
ateneo
org
plot
organ
organi
amm
rettore
o
0
rect

I tried something like this:

public static List<String> extractAllWordsFromUrlPath(String link) {
    List<String> splittedUrlPath = splitLinkPath(link);
    List<String> urlWords = new ArrayList<String>();
    if(splittedUrlPath!=null && splittedUrlPath.size()>0) {
        Pattern linkWordsPattern = Pattern.compile("[-_]?[a-z]+[-_]?");
        for(String sPath: splittedUrlPath) {


        Matcher lwpm = linkWordsPattern.matcher(sPath);
        while(lwpm.find()) {
            urlWords.add(lwpm.group());
        }
        }
    }

    return urlWords;
}

One approach is to remove the first component of the URL before the first path separator. Then, split the remaining string on [/_-] :

String url = "https://www.uni.it/it/ateneo-org_plot-pesc/organ/organi-amm/rettore-o_0-rect"; 
URL theURL = new URL(url);
String path = theURL.getPath();
String[] parts = path.split("[/_-]");

for (String part : parts)  {
    System.out.println(part + " ");
}

it ateneo org plot organ organi amm rettore o 0 rect

Note that I used java.net.URL to extract the path from the input URL. We could also try doing this via regex, but it might be error prone or not cover all possible types of URLs.

I offer my answer, with the emphasis on "minimal" changes to your code. Note that this code isn't really "production-ready" and definitely needs certain re-think on numerous parts, including static method usage, handling exceptions, etc, but will definitely be a great prototype for you (which I presume your snippet is as well!). It is also created in such way for you to easily debug through code.

public static List<String> extractAllWordsFromUrlPath(String link) throws MalformedURLException {

    String path = new URL(link).getPath();
    String regex = "[/_-]";  // set somewhere in config file, input as method argument?
    String[] extractedWords = path.split(regex);
    List<String> result = Arrays.asList(extractedWords);

    return result.stream().filter(w -> (w != null && w.length() > 0)).collect(Collectors.toList());
}

Method returns List just to keep up with your decision. Note that streams are java 8 feature and there may be some over-engineering feeling in that code, ie when you look at part with ensuring that the list doesn't contain a null-value String. Also keep in mind that Arrays.asList() returns as immutable list , just in case if you ever utilize it for casting an array to list in some other parts of your code.

You can verify this code by utilizing the for(String word : parsedWords ) solution in your other method, so you can also combine it with @Tim Bergenstein's solution, so I also upwoted his answer, it gives a great basis and I just expanded it to handle the empty strings, null values, quick exception handling and some naming norms:

//code in your other method, of main class, just for testing
List<String> parsedWords = extractAllWordsFromUrlPath("http://www.google.com/asd/asd/dfg/kjg");
for(String word: parsedWords) {
    System.out.println(word + " ");
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM