简体   繁体   English

如何将自定义注释转换为UIMA CAS结构并将它们序列化为XMI

[英]How to convert custom annotations to UIMA CAS structures and serialize them to XMI

I am having a problem converting custom annotated documents to UIMA CASes and then serializing them to XMI in order to view the annotations through the UIMA annotation viewer GUI. 我在将自定义带注释的文档转换为UIMA CAS并将其序列化为XMI时遇到问题,以便通过UIMA注释查看器GUI查看注释。

I am using uimaFIT to construct my components due to the fact that it is more easy to control, test and debug. 我正在使用uimaFIT构建我的组件,因为它更容易控制,测试和调试。 The pipeline is constructed from 3 components: 管道由3个组成部分构成:

  • CollectionReader component reading files with raw text. CollectionReader组件使用原始文本读取文件。
  • Annotator component for converting annotations from the custom documents to UIMA annotations Annotator从自定义文件UIMA注释转换注释组件
  • CasConsumer component which serializes the CASes to XMI CasConsumer组件将CASes序列化为XMI

My pipeline works and outputs XMI files at the end but without the annotations. 我的管道在最后工作并输出XMI文件但没有注释。 I do not understand very clearly how do the CAS objects get passed between the components. 我不太清楚CAS对象如何在组件之间传递。 The annotator logic consists in making RESTful calls to certain endpoints and by using the client SDK provided by the service I am trying to convert the annotation models. 注释器逻辑包括对某些端点进行RESTful调用,并使用我尝试转换注释模型的服务提供的客户端SDK。 The conversion logic part of the Annotator component looks like this: Annotator组件的转换逻辑部分如下所示:

public class CustomDocumentToUimaCasConverter implements UimaCasConverter {
    private TypeSystemDescription tsd;

    private AnnotatedDocument startDocument;

    private ArrayFS annotationFeatureStructures;

    private int featureStructureArrayCapacity;

    public AnnotatedDocument getStartDocument() {
        return startDocument;
    }

    public CustomDocumentToUimaCasConverter(AnnotatedDocument startDocument) {
        try {
            this.tsd = TypeSystemDescriptionFactory.createTypeSystemDescription();
        } catch (ResourceInitializationException e) {
            LOG.error("Error when creating default type system", e);
        }
        this.startDocument = startDocument;
    }


    public TypeSystemDescription getTypeSystemDescription() {
        return this.tsd;
    }

    @Override
    public void convertAnnotations(CAS cas) {
        Map<String, List<Annotation>> entities = this.startDocument.entities;
        int featureStructureArrayIndex = 0;

        inferCasTypeSystem(entities.keySet());
        try {
            /*
             * This is a hack allowing the CAS object to have an updated type system.
             * We are creating a new CAS by passing the new TypeSystemDescription which actually
             * should have been updated by an internal call of typeSystemInit(cas.getTypeSystem())
             * originally part of the CasInitializer interface that is now deprecated and the CollectionReader
             * is calling it internally in its implementation. The problem consists in the fact that now the
             * the typeSystemInit method of the CasInitializer_ImplBase has an empty implementation and
             * nothing changes!
             */
            LOG.info("Creating new CAS with updated typesystem...");
            cas = CasCreationUtils.createCas(tsd, null, null);
        } catch (ResourceInitializationException e) {
            LOG.info("Error creating new CAS!", e);
        }

        TypeSystem typeSystem = cas.getTypeSystem();
        this.featureStructureArrayCapacity = entities.size();
        this.annotationFeatureStructures = cas.createArrayFS(featureStructureArrayCapacity);

        for (Map.Entry<String, List<Annotation>> entityEntry : entities.entrySet()) {
            String annotationName = entityEntry.getKey();
            annotationName = UIMA_ANNOTATION_TYPES_PACKAGE + removeDashes(annotationName);
            Type type = typeSystem.getType(annotationName);

            List<Annotation> annotations = entityEntry.getValue();
            LOG.info("Get Type -> " + type);
            for (Annotation ann : annotations) {
                AnnotationFS afs = cas.createAnnotation(type, (int) ann.startOffset, (int) ann.endOffset);
                cas.addFsToIndexes(afs);
                if (featureStructureArrayIndex + 1 == featureStructureArrayCapacity) {
                    resizeArrayFS(featureStructureArrayCapacity * 2, annotationFeatureStructures, cas);
                }
                annotationFeatureStructures.set(featureStructureArrayIndex++, afs);
            }
        }
        cas.removeFsFromIndexes(annotationFeatureStructures);
        cas.addFsToIndexes(annotationFeatureStructures);
    }

    @Override
    public void inferCasTypeSystem(Iterable<String> originalTypes) {
        for (String typeName : originalTypes) {
            //UIMA Annotations are not allowed to contain dashes
            typeName = removeDashes(typeName);
            tsd.addType(UIMA_ANNOTATION_TYPES_PACKAGE + typeName,
                    "Automatically generated type for " + typeName, "uima.tcas.Annotation");
            LOG.info("Inserted new type -> " + typeName);
        }
    }

    /**
     * Removes dashes from UIMA Annotations because they are not allowed to contain dashes.
     *
     * @param typeName the annotation name of the current annotation of the source document
     * @return the transformed annotation name suited for the UIMA typesystem
     */
    private String removeDashes(String typeName) {
        if (typeName.contains("-")) {
            typeName = typeName.replaceAll("-", "_");
        }
        return typeName;
    }

    @Override
    public void setSourceDocumentText(CAS cas) {
        cas.setSofaDataString(startDocument.text, "text/plain");
    }

    private void resizeArrayFS(int newCapacity, ArrayFS originalArray, CAS cas) {
        ArrayFS biggerArrayFS = cas.createArrayFS(newCapacity);
        biggerArrayFS.copyFromArray(originalArray.toArray(), 0, 0, originalArray.size());
        this.annotationFeatureStructures = biggerArrayFS;
        this.featureStructureArrayCapacity = annotationFeatureStructures.size();
    }
}

` If someone has dealt with annotation convertions to UIMA types I would appreciate some help. `如果有人处理了UIMA类型的注释转换,我将不胜感激。

I think your understanding of CASes and Annotations may be wrong: 我认为您对CASes和Annotations的理解可能是错误的:

From

* This is a hack allowing the CAS object to have an updated type system.

and

 LOG.info("Creating new CAS with updated typesystem...");
            cas = CasCreationUtils.createCas(tsd, null, null);

I gather that you try to create a new CAS in your Annotator's process() method (I assume that the code you posted is executed there). 我想你试着在Annotator的process()方法中创建一个新的CAS(我假设你发布的代码在那里执行)。 Unless you are implementing a CAS multiplier, this is not the way to do it. 除非您正在实施CAS乘数,否则这不是实现它的方法。 Typically, the collectionreader ingests raw data and creates a CAS for you in its getNext() method. 通常,collectionreader会在其getNext()方法中提取原始数据并为您创建CAS。 This CAS is passed down the whole UIMA pipeline, and all you need to do is add UIMA annotations to it. 此CAS在整个UIMA管道中传递,您需要做的就是为其添加UIMA注释。

For each Annotation that you want to add, the type system should be known to UIMA. 对于要添加的每个注释,UIMA应该知道类型系统。 If you use JCasGen and the code it generates, this should not be a problem. 如果您使用JCasGen及其生成的代码,这应该不是问题。 Make sure that your types can be autodetected as described here: http://uima.apache.org/d/uimafit-current/tools.uimafit.book.html#d5e531 ). 确保您的类型可以按照此处所述进行自动检测: http//uima.apache.org/d/uimafit-current/tools.uimafit.book.html#d5e531 )。

This allows you to instantiate Annotations using Java Objects, instead of using low-level Fs calls. 这允许您使用Java对象实例化注释,而不是使用低级别的Fs调用。 The following snippet adds an annotation over the whole document text. 以下代码段在整个文档文本中添加注释。 It should be trivial to add iterating logic over tokens the in the text and their ingested (non-UIMA) annotations (using your web service). 将迭代逻辑添加到文本中的标记及其摄取(非UIMA)注释(使用您的Web服务)应该是微不足道的。

@Override
public void process(JCas aJCas) throws AnalysisEngineProcessException {
    String text = aJCas.getDocumentText();
    SomeAnnotation a = new SomeAnnotation(aJCas);
    // set the annotation properties
    // for each property, JCasGen should have
    // generated a setter
    a.setSomePropertyValue(someValue);
    // add your annotation to the indexes
    a.setBegin(0);
    a.setEnd(text.length());
    a.addToIndexes(aJCas);
}

In order to avoid messing around with starting and ending String indexes, I suggest you use some Token annotation (from DKPro Core, for example: https://dkpro.github.io/dkpro-core/ ), that you can use as anchor point for your custom annotations. 为了避免搞乱开始和结束String索引,我建议你使用一些令牌注释(来自DKPro Core,例如: https ://dkpro.github.io/dkpro-core/),你可以用作锚点指向您的自定义注释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM