简体   繁体   中英

Removing POS tags from a string

I have a string that looks like:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.

I want to extract only the raw text and discard the POS tags. What Regex can I use to do this. I know I can split over / but I need to remove the tags as well and get. Should I use a Regex to identify the tags?

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced "no evidence" that any irregularities took place .

You can use String#replaceAll() with the pattern /.*?(\\s|$) to remove the POS tags. I think the following code should get you pretty close to where you want to be.

String input = "The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.";
input = input.replaceAll("/.*?(?:\\s|$)", " ");
System.out.println(input);

Output:

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary
election produced  no evidence " that any irregularities took place . "

So this is what I quickly wrote to extract the required string. Do you have any better/more efficient ideas as I need to do this over tons of data?

public static void main(String args[]) {

            StringBuilder sb = new StringBuilder();


            String str = "The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.";
            String [] newLine = str.split(" ");
            for (String word : newLine){
                int index = word.indexOf("/");
                String newWord = word.substring(0, index);
                sb.append(newWord);
                sb.append(" ");

            }
            System.out.println(sb);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM