如何將自定義注釋轉換為UIMA CAS結構並將它們序列化為XMI

Question

我在將自定義帶注釋的文檔轉換為UIMA CAS並將其序列化為XMI時遇到問題，以便通過UIMA注釋查看器GUI查看注釋。

我正在使用uimaFIT構建我的組件，因為它更容易控制，測試和調試。 管道由3個組成部分構成：

CollectionReader組件使用原始文本讀取文件。
Annotator從自定義文件UIMA注釋轉換注釋組件
CasConsumer組件將CASes序列化為XMI

我的管道在最后工作並輸出XMI文件但沒有注釋。 我不太清楚CAS對象如何在組件之間傳遞。 注釋器邏輯包括對某些端點進行RESTful調用，並使用我嘗試轉換注釋模型的服務提供的客戶端SDK。 Annotator組件的轉換邏輯部分如下所示：

public class CustomDocumentToUimaCasConverter implements UimaCasConverter {
    private TypeSystemDescription tsd;

    private AnnotatedDocument startDocument;

    private ArrayFS annotationFeatureStructures;

    private int featureStructureArrayCapacity;

    public AnnotatedDocument getStartDocument() {
        return startDocument;
    }

    public CustomDocumentToUimaCasConverter(AnnotatedDocument startDocument) {
        try {
            this.tsd = TypeSystemDescriptionFactory.createTypeSystemDescription();
        } catch (ResourceInitializationException e) {
            LOG.error("Error when creating default type system", e);
        }
        this.startDocument = startDocument;
    }


    public TypeSystemDescription getTypeSystemDescription() {
        return this.tsd;
    }

    @Override
    public void convertAnnotations(CAS cas) {
        Map<String, List<Annotation>> entities = this.startDocument.entities;
        int featureStructureArrayIndex = 0;

        inferCasTypeSystem(entities.keySet());
        try {
            /*
             * This is a hack allowing the CAS object to have an updated type system.
             * We are creating a new CAS by passing the new TypeSystemDescription which actually
             * should have been updated by an internal call of typeSystemInit(cas.getTypeSystem())
             * originally part of the CasInitializer interface that is now deprecated and the CollectionReader
             * is calling it internally in its implementation. The problem consists in the fact that now the
             * the typeSystemInit method of the CasInitializer_ImplBase has an empty implementation and
             * nothing changes!
             */
            LOG.info("Creating new CAS with updated typesystem...");
            cas = CasCreationUtils.createCas(tsd, null, null);
        } catch (ResourceInitializationException e) {
            LOG.info("Error creating new CAS!", e);
        }

        TypeSystem typeSystem = cas.getTypeSystem();
        this.featureStructureArrayCapacity = entities.size();
        this.annotationFeatureStructures = cas.createArrayFS(featureStructureArrayCapacity);

        for (Map.Entry<String, List<Annotation>> entityEntry : entities.entrySet()) {
            String annotationName = entityEntry.getKey();
            annotationName = UIMA_ANNOTATION_TYPES_PACKAGE + removeDashes(annotationName);
            Type type = typeSystem.getType(annotationName);

            List<Annotation> annotations = entityEntry.getValue();
            LOG.info("Get Type -> " + type);
            for (Annotation ann : annotations) {
                AnnotationFS afs = cas.createAnnotation(type, (int) ann.startOffset, (int) ann.endOffset);
                cas.addFsToIndexes(afs);
                if (featureStructureArrayIndex + 1 == featureStructureArrayCapacity) {
                    resizeArrayFS(featureStructureArrayCapacity * 2, annotationFeatureStructures, cas);
                }
                annotationFeatureStructures.set(featureStructureArrayIndex++, afs);
            }
        }
        cas.removeFsFromIndexes(annotationFeatureStructures);
        cas.addFsToIndexes(annotationFeatureStructures);
    }

    @Override
    public void inferCasTypeSystem(Iterable<String> originalTypes) {
        for (String typeName : originalTypes) {
            //UIMA Annotations are not allowed to contain dashes
            typeName = removeDashes(typeName);
            tsd.addType(UIMA_ANNOTATION_TYPES_PACKAGE + typeName,
                    "Automatically generated type for " + typeName, "uima.tcas.Annotation");
            LOG.info("Inserted new type -> " + typeName);
        }
    }

    /**
     * Removes dashes from UIMA Annotations because they are not allowed to contain dashes.
     *
     * @param typeName the annotation name of the current annotation of the source document
     * @return the transformed annotation name suited for the UIMA typesystem
     */
    private String removeDashes(String typeName) {
        if (typeName.contains("-")) {
            typeName = typeName.replaceAll("-", "_");
        }
        return typeName;
    }

    @Override
    public void setSourceDocumentText(CAS cas) {
        cas.setSofaDataString(startDocument.text, "text/plain");
    }

    private void resizeArrayFS(int newCapacity, ArrayFS originalArray, CAS cas) {
        ArrayFS biggerArrayFS = cas.createArrayFS(newCapacity);
        biggerArrayFS.copyFromArray(originalArray.toArray(), 0, 0, originalArray.size());
        this.annotationFeatureStructures = biggerArrayFS;
        this.featureStructureArrayCapacity = annotationFeatureStructures.size();
    }
}

`如果有人處理了UIMA類型的注釋轉換，我將不勝感激。

Answer 1

我認為您對CASes和Annotations的理解可能是錯誤的：

從

* This is a hack allowing the CAS object to have an updated type system.

和

 LOG.info("Creating new CAS with updated typesystem...");
            cas = CasCreationUtils.createCas(tsd, null, null);

我想你試着在Annotator的process（）方法中創建一個新的CAS（我假設你發布的代碼在那里執行）。 除非您正在實施CAS乘數，否則這不是實現它的方法。 通常，collectionreader會在其getNext（）方法中提取原始數據並為您創建CAS。 此CAS在整個UIMA管道中傳遞，您需要做的就是為其添加UIMA注釋。

對於要添加的每個注釋，UIMA應該知道類型系統。 如果您使用JCasGen及其生成的代碼，這應該不是問題。 確保您的類型可以按照此處所述進行自動檢測： http ： //uima.apache.org/d/uimafit-current/tools.uimafit.book.html#d5e531 ）。

這允許您使用Java對象實例化注釋，而不是使用低級別的Fs調用。 以下代碼段在整個文檔文本中添加注釋。 將迭代邏輯添加到文本中的標記及其攝取（非UIMA）注釋（使用您的Web服務）應該是微不足道的。

@Override
public void process(JCas aJCas) throws AnalysisEngineProcessException {
    String text = aJCas.getDocumentText();
    SomeAnnotation a = new SomeAnnotation(aJCas);
    // set the annotation properties
    // for each property, JCasGen should have
    // generated a setter
    a.setSomePropertyValue(someValue);
    // add your annotation to the indexes
    a.setBegin(0);
    a.setEnd(text.length());
    a.addToIndexes(aJCas);
}

為了避免搞亂開始和結束String索引，我建議你使用一些令牌注釋（來自DKPro Core，例如： https ：//dkpro.github.io/dkpro-core/），你可以用作錨點指向您的自定義注釋。

如何將自定義注釋轉換為UIMA CAS結構並將它們序列化為XMI

問題描述

1 個解決方案

解決方案1
0 2015-03-06 07:55:33

如何將自定義注釋轉換為UIMA CAS結構並將它們序列化為XMI

問題描述

1 個解決方案

解決方案1 0 2015-03-06 07:55:33

解決方案1
0 2015-03-06 07:55:33