簡體   English   中英

如何將自定義注釋轉換為UIMA CAS結構並將它們序列化為XMI

[英]How to convert custom annotations to UIMA CAS structures and serialize them to XMI

我在將自定義帶注釋的文檔轉換為UIMA CAS並將其序列化為XMI時遇到問題,以便通過UIMA注釋查看器GUI查看注釋。

我正在使用uimaFIT構建我的組件,因為它更容易控制,測試和調試。 管道由3個組成部分構成:

  • CollectionReader組件使用原始文本讀取文件。
  • Annotator從自定義文件UIMA注釋轉換注釋組件
  • CasConsumer組件將CASes序列化為XMI

我的管道在最后工作並輸出XMI文件但沒有注釋。 我不太清楚CAS對象如何在組件之間傳遞。 注釋器邏輯包括對某些端點進行RESTful調用,並使用我嘗試轉換注釋模型的服務提供的客戶端SDK。 Annotator組件的轉換邏輯部分如下所示:

public class CustomDocumentToUimaCasConverter implements UimaCasConverter {
    private TypeSystemDescription tsd;

    private AnnotatedDocument startDocument;

    private ArrayFS annotationFeatureStructures;

    private int featureStructureArrayCapacity;

    public AnnotatedDocument getStartDocument() {
        return startDocument;
    }

    public CustomDocumentToUimaCasConverter(AnnotatedDocument startDocument) {
        try {
            this.tsd = TypeSystemDescriptionFactory.createTypeSystemDescription();
        } catch (ResourceInitializationException e) {
            LOG.error("Error when creating default type system", e);
        }
        this.startDocument = startDocument;
    }


    public TypeSystemDescription getTypeSystemDescription() {
        return this.tsd;
    }

    @Override
    public void convertAnnotations(CAS cas) {
        Map<String, List<Annotation>> entities = this.startDocument.entities;
        int featureStructureArrayIndex = 0;

        inferCasTypeSystem(entities.keySet());
        try {
            /*
             * This is a hack allowing the CAS object to have an updated type system.
             * We are creating a new CAS by passing the new TypeSystemDescription which actually
             * should have been updated by an internal call of typeSystemInit(cas.getTypeSystem())
             * originally part of the CasInitializer interface that is now deprecated and the CollectionReader
             * is calling it internally in its implementation. The problem consists in the fact that now the
             * the typeSystemInit method of the CasInitializer_ImplBase has an empty implementation and
             * nothing changes!
             */
            LOG.info("Creating new CAS with updated typesystem...");
            cas = CasCreationUtils.createCas(tsd, null, null);
        } catch (ResourceInitializationException e) {
            LOG.info("Error creating new CAS!", e);
        }

        TypeSystem typeSystem = cas.getTypeSystem();
        this.featureStructureArrayCapacity = entities.size();
        this.annotationFeatureStructures = cas.createArrayFS(featureStructureArrayCapacity);

        for (Map.Entry<String, List<Annotation>> entityEntry : entities.entrySet()) {
            String annotationName = entityEntry.getKey();
            annotationName = UIMA_ANNOTATION_TYPES_PACKAGE + removeDashes(annotationName);
            Type type = typeSystem.getType(annotationName);

            List<Annotation> annotations = entityEntry.getValue();
            LOG.info("Get Type -> " + type);
            for (Annotation ann : annotations) {
                AnnotationFS afs = cas.createAnnotation(type, (int) ann.startOffset, (int) ann.endOffset);
                cas.addFsToIndexes(afs);
                if (featureStructureArrayIndex + 1 == featureStructureArrayCapacity) {
                    resizeArrayFS(featureStructureArrayCapacity * 2, annotationFeatureStructures, cas);
                }
                annotationFeatureStructures.set(featureStructureArrayIndex++, afs);
            }
        }
        cas.removeFsFromIndexes(annotationFeatureStructures);
        cas.addFsToIndexes(annotationFeatureStructures);
    }

    @Override
    public void inferCasTypeSystem(Iterable<String> originalTypes) {
        for (String typeName : originalTypes) {
            //UIMA Annotations are not allowed to contain dashes
            typeName = removeDashes(typeName);
            tsd.addType(UIMA_ANNOTATION_TYPES_PACKAGE + typeName,
                    "Automatically generated type for " + typeName, "uima.tcas.Annotation");
            LOG.info("Inserted new type -> " + typeName);
        }
    }

    /**
     * Removes dashes from UIMA Annotations because they are not allowed to contain dashes.
     *
     * @param typeName the annotation name of the current annotation of the source document
     * @return the transformed annotation name suited for the UIMA typesystem
     */
    private String removeDashes(String typeName) {
        if (typeName.contains("-")) {
            typeName = typeName.replaceAll("-", "_");
        }
        return typeName;
    }

    @Override
    public void setSourceDocumentText(CAS cas) {
        cas.setSofaDataString(startDocument.text, "text/plain");
    }

    private void resizeArrayFS(int newCapacity, ArrayFS originalArray, CAS cas) {
        ArrayFS biggerArrayFS = cas.createArrayFS(newCapacity);
        biggerArrayFS.copyFromArray(originalArray.toArray(), 0, 0, originalArray.size());
        this.annotationFeatureStructures = biggerArrayFS;
        this.featureStructureArrayCapacity = annotationFeatureStructures.size();
    }
}

`如果有人處理了UIMA類型的注釋轉換,我將不勝感激。

我認為您對CASes和Annotations的理解可能是錯誤的:

* This is a hack allowing the CAS object to have an updated type system.

 LOG.info("Creating new CAS with updated typesystem...");
            cas = CasCreationUtils.createCas(tsd, null, null);

我想你試着在Annotator的process()方法中創建一個新的CAS(我假設你發布的代碼在那里執行)。 除非您正在實施CAS乘數,否則這不是實現它的方法。 通常,collectionreader會在其getNext()方法中提取原始數據並為您創建CAS。 此CAS在整個UIMA管道中傳遞,您需要做的就是為其添加UIMA注釋。

對於要添加的每個注釋,UIMA應該知道類型系統。 如果您使用JCasGen及其生成的代碼,這應該不是問題。 確保您的類型可以按照此處所述進行自動檢測: http//uima.apache.org/d/uimafit-current/tools.uimafit.book.html#d5e531 )。

這允許您使用Java對象實例化注釋,而不是使用低級別的Fs調用。 以下代碼段在整個文檔文本中添加注釋。 將迭代邏輯添加到文本中的標記及其攝取(非UIMA)注釋(使用您的Web服務)應該是微不足道的。

@Override
public void process(JCas aJCas) throws AnalysisEngineProcessException {
    String text = aJCas.getDocumentText();
    SomeAnnotation a = new SomeAnnotation(aJCas);
    // set the annotation properties
    // for each property, JCasGen should have
    // generated a setter
    a.setSomePropertyValue(someValue);
    // add your annotation to the indexes
    a.setBegin(0);
    a.setEnd(text.length());
    a.addToIndexes(aJCas);
}

為了避免搞亂開始和結束String索引,我建議你使用一些令牌注釋(來自DKPro Core,例如: https ://dkpro.github.io/dkpro-core/),你可以用作錨點指向您的自定義注釋。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM