简体   繁体   中英

How to convert custom annotations to UIMA CAS structures and serialize them to XMI

I am having a problem converting custom annotated documents to UIMA CASes and then serializing them to XMI in order to view the annotations through the UIMA annotation viewer GUI.

I am using uimaFIT to construct my components due to the fact that it is more easy to control, test and debug. The pipeline is constructed from 3 components:

  • CollectionReader component reading files with raw text.
  • Annotator component for converting annotations from the custom documents to UIMA annotations
  • CasConsumer component which serializes the CASes to XMI

My pipeline works and outputs XMI files at the end but without the annotations. I do not understand very clearly how do the CAS objects get passed between the components. The annotator logic consists in making RESTful calls to certain endpoints and by using the client SDK provided by the service I am trying to convert the annotation models. The conversion logic part of the Annotator component looks like this:

public class CustomDocumentToUimaCasConverter implements UimaCasConverter {
    private TypeSystemDescription tsd;

    private AnnotatedDocument startDocument;

    private ArrayFS annotationFeatureStructures;

    private int featureStructureArrayCapacity;

    public AnnotatedDocument getStartDocument() {
        return startDocument;
    }

    public CustomDocumentToUimaCasConverter(AnnotatedDocument startDocument) {
        try {
            this.tsd = TypeSystemDescriptionFactory.createTypeSystemDescription();
        } catch (ResourceInitializationException e) {
            LOG.error("Error when creating default type system", e);
        }
        this.startDocument = startDocument;
    }


    public TypeSystemDescription getTypeSystemDescription() {
        return this.tsd;
    }

    @Override
    public void convertAnnotations(CAS cas) {
        Map<String, List<Annotation>> entities = this.startDocument.entities;
        int featureStructureArrayIndex = 0;

        inferCasTypeSystem(entities.keySet());
        try {
            /*
             * This is a hack allowing the CAS object to have an updated type system.
             * We are creating a new CAS by passing the new TypeSystemDescription which actually
             * should have been updated by an internal call of typeSystemInit(cas.getTypeSystem())
             * originally part of the CasInitializer interface that is now deprecated and the CollectionReader
             * is calling it internally in its implementation. The problem consists in the fact that now the
             * the typeSystemInit method of the CasInitializer_ImplBase has an empty implementation and
             * nothing changes!
             */
            LOG.info("Creating new CAS with updated typesystem...");
            cas = CasCreationUtils.createCas(tsd, null, null);
        } catch (ResourceInitializationException e) {
            LOG.info("Error creating new CAS!", e);
        }

        TypeSystem typeSystem = cas.getTypeSystem();
        this.featureStructureArrayCapacity = entities.size();
        this.annotationFeatureStructures = cas.createArrayFS(featureStructureArrayCapacity);

        for (Map.Entry<String, List<Annotation>> entityEntry : entities.entrySet()) {
            String annotationName = entityEntry.getKey();
            annotationName = UIMA_ANNOTATION_TYPES_PACKAGE + removeDashes(annotationName);
            Type type = typeSystem.getType(annotationName);

            List<Annotation> annotations = entityEntry.getValue();
            LOG.info("Get Type -> " + type);
            for (Annotation ann : annotations) {
                AnnotationFS afs = cas.createAnnotation(type, (int) ann.startOffset, (int) ann.endOffset);
                cas.addFsToIndexes(afs);
                if (featureStructureArrayIndex + 1 == featureStructureArrayCapacity) {
                    resizeArrayFS(featureStructureArrayCapacity * 2, annotationFeatureStructures, cas);
                }
                annotationFeatureStructures.set(featureStructureArrayIndex++, afs);
            }
        }
        cas.removeFsFromIndexes(annotationFeatureStructures);
        cas.addFsToIndexes(annotationFeatureStructures);
    }

    @Override
    public void inferCasTypeSystem(Iterable<String> originalTypes) {
        for (String typeName : originalTypes) {
            //UIMA Annotations are not allowed to contain dashes
            typeName = removeDashes(typeName);
            tsd.addType(UIMA_ANNOTATION_TYPES_PACKAGE + typeName,
                    "Automatically generated type for " + typeName, "uima.tcas.Annotation");
            LOG.info("Inserted new type -> " + typeName);
        }
    }

    /**
     * Removes dashes from UIMA Annotations because they are not allowed to contain dashes.
     *
     * @param typeName the annotation name of the current annotation of the source document
     * @return the transformed annotation name suited for the UIMA typesystem
     */
    private String removeDashes(String typeName) {
        if (typeName.contains("-")) {
            typeName = typeName.replaceAll("-", "_");
        }
        return typeName;
    }

    @Override
    public void setSourceDocumentText(CAS cas) {
        cas.setSofaDataString(startDocument.text, "text/plain");
    }

    private void resizeArrayFS(int newCapacity, ArrayFS originalArray, CAS cas) {
        ArrayFS biggerArrayFS = cas.createArrayFS(newCapacity);
        biggerArrayFS.copyFromArray(originalArray.toArray(), 0, 0, originalArray.size());
        this.annotationFeatureStructures = biggerArrayFS;
        this.featureStructureArrayCapacity = annotationFeatureStructures.size();
    }
}

` If someone has dealt with annotation convertions to UIMA types I would appreciate some help.

I think your understanding of CASes and Annotations may be wrong:

From

* This is a hack allowing the CAS object to have an updated type system.

and

 LOG.info("Creating new CAS with updated typesystem...");
            cas = CasCreationUtils.createCas(tsd, null, null);

I gather that you try to create a new CAS in your Annotator's process() method (I assume that the code you posted is executed there). Unless you are implementing a CAS multiplier, this is not the way to do it. Typically, the collectionreader ingests raw data and creates a CAS for you in its getNext() method. This CAS is passed down the whole UIMA pipeline, and all you need to do is add UIMA annotations to it.

For each Annotation that you want to add, the type system should be known to UIMA. If you use JCasGen and the code it generates, this should not be a problem. Make sure that your types can be autodetected as described here: http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#d5e531 ).

This allows you to instantiate Annotations using Java Objects, instead of using low-level Fs calls. The following snippet adds an annotation over the whole document text. It should be trivial to add iterating logic over tokens the in the text and their ingested (non-UIMA) annotations (using your web service).

@Override
public void process(JCas aJCas) throws AnalysisEngineProcessException {
    String text = aJCas.getDocumentText();
    SomeAnnotation a = new SomeAnnotation(aJCas);
    // set the annotation properties
    // for each property, JCasGen should have
    // generated a setter
    a.setSomePropertyValue(someValue);
    // add your annotation to the indexes
    a.setBegin(0);
    a.setEnd(text.length());
    a.addToIndexes(aJCas);
}

In order to avoid messing around with starting and ending String indexes, I suggest you use some Token annotation (from DKPro Core, for example: https://dkpro.github.io/dkpro-core/ ), that you can use as anchor point for your custom annotations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM