简体   繁体   中英

How to train Stanford CRF NER from a tsv file

I am looking to train my own model, eg this string I need to run through my trained model: "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q "

The tsv file looks like this:

Toyota  PERS
Land    PERS

When I run it through the programme:

public static void main(String[] args) {
        String serializedClassifier2 = "C:/standford-ner/ner-model.ser.gz";
      try {
            NERClassifierCombiner classifier = new NERClassifierCombiner(false, false, 
                     serializedClassifier2);
            String ss = "Book of 49 Magazine Articles on Toyota Land Cruiser 1956-1987 Gold Portfolio http://t.co/EqxmY1VmLg http://t.co/F0Vefuoj9Q";
            System.out.println("---");
            List<List<CoreLabel>> out = classifier.classify(ss);
            for (List<CoreLabel> sentence : out) {
              for (CoreLabel word : sentence) {
                System.out.print(word.word() + '/' + word.get(AnswerAnnotation.class) + ' ');
              }
              System.out.println();
            }


        } catch (ClassCastException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }  catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

    }

Here is the output I am getting:

Book/PERS of/PERS 49/O Magazine/PERS Articles/PERS on/O Toyota/PERS Land/PERS Cruiser/O 1956-1987/PERS Gold/PERS Portfolio/PERS http://t.co/EqxmY1VmLg/PERS http://t.co/F0Vefuoj9Q/PERS 

For me, the output is wrong. I need to get Book/O of/O . I am not sure how it's getting this value, eg "book" is not mentioned in my tsv file. The words I have not mentioned in the tsv file should come as O . This tsv file is just the beginning; I have many more words to add.

You've given the classifier training data where 100% of the data is one class: PERS . Since 100% of your training data is that class, it's going to give you back 100% of assignments to that class.

For the algorithm, O is simply another class. You've given it no examples of O , so it will classify nothing as O .

The Stanford NER CRF FAQ gives an example of training data :

CHAPTER O
I   O
Emma    PERS
Woodhouse   PERS
,   O
handsome    O
,   O
clever  O
,   O
and O
rich    O
,   O
with    O
a   O
comfortable O
home    O
...

So you can see, there, they tag every token in situ , with plenty of examples of the O class. I'm not entirely familiar with the workings of the CRF classifier, but I suspect you need to give it actual data, tagged appropriately, not just a list of examples of members of your target classes.

That begs another question, though--if you simply want to match strings for this task, why are you using NER? Why not just match strings? If that's your goal, it will save you serious headache to skip the sophisticated NLP. You'll get results faster that are easier to tweak by hand.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM