[英]How to convert custom annotations to UIMA CAS structures and serialize them to XMI
我在將自定義帶注釋的文檔轉換為UIMA CAS並將其序列化為XMI時遇到問題,以便通過UIMA注釋查看器GUI查看注釋。
我正在使用uimaFIT構建我的組件,因為它更容易控制,測試和調試。 管道由3個組成部分構成:
CollectionReader
組件使用原始文本讀取文件。 Annotator
從自定義文件UIMA注釋轉換注釋組件 CasConsumer
組件將CASes序列化為XMI 我的管道在最后工作並輸出XMI文件但沒有注釋。 我不太清楚CAS對象如何在組件之間傳遞。 注釋器邏輯包括對某些端點進行RESTful調用,並使用我嘗試轉換注釋模型的服務提供的客戶端SDK。 Annotator
組件的轉換邏輯部分如下所示:
public class CustomDocumentToUimaCasConverter implements UimaCasConverter {
private TypeSystemDescription tsd;
private AnnotatedDocument startDocument;
private ArrayFS annotationFeatureStructures;
private int featureStructureArrayCapacity;
public AnnotatedDocument getStartDocument() {
return startDocument;
}
public CustomDocumentToUimaCasConverter(AnnotatedDocument startDocument) {
try {
this.tsd = TypeSystemDescriptionFactory.createTypeSystemDescription();
} catch (ResourceInitializationException e) {
LOG.error("Error when creating default type system", e);
}
this.startDocument = startDocument;
}
public TypeSystemDescription getTypeSystemDescription() {
return this.tsd;
}
@Override
public void convertAnnotations(CAS cas) {
Map<String, List<Annotation>> entities = this.startDocument.entities;
int featureStructureArrayIndex = 0;
inferCasTypeSystem(entities.keySet());
try {
/*
* This is a hack allowing the CAS object to have an updated type system.
* We are creating a new CAS by passing the new TypeSystemDescription which actually
* should have been updated by an internal call of typeSystemInit(cas.getTypeSystem())
* originally part of the CasInitializer interface that is now deprecated and the CollectionReader
* is calling it internally in its implementation. The problem consists in the fact that now the
* the typeSystemInit method of the CasInitializer_ImplBase has an empty implementation and
* nothing changes!
*/
LOG.info("Creating new CAS with updated typesystem...");
cas = CasCreationUtils.createCas(tsd, null, null);
} catch (ResourceInitializationException e) {
LOG.info("Error creating new CAS!", e);
}
TypeSystem typeSystem = cas.getTypeSystem();
this.featureStructureArrayCapacity = entities.size();
this.annotationFeatureStructures = cas.createArrayFS(featureStructureArrayCapacity);
for (Map.Entry<String, List<Annotation>> entityEntry : entities.entrySet()) {
String annotationName = entityEntry.getKey();
annotationName = UIMA_ANNOTATION_TYPES_PACKAGE + removeDashes(annotationName);
Type type = typeSystem.getType(annotationName);
List<Annotation> annotations = entityEntry.getValue();
LOG.info("Get Type -> " + type);
for (Annotation ann : annotations) {
AnnotationFS afs = cas.createAnnotation(type, (int) ann.startOffset, (int) ann.endOffset);
cas.addFsToIndexes(afs);
if (featureStructureArrayIndex + 1 == featureStructureArrayCapacity) {
resizeArrayFS(featureStructureArrayCapacity * 2, annotationFeatureStructures, cas);
}
annotationFeatureStructures.set(featureStructureArrayIndex++, afs);
}
}
cas.removeFsFromIndexes(annotationFeatureStructures);
cas.addFsToIndexes(annotationFeatureStructures);
}
@Override
public void inferCasTypeSystem(Iterable<String> originalTypes) {
for (String typeName : originalTypes) {
//UIMA Annotations are not allowed to contain dashes
typeName = removeDashes(typeName);
tsd.addType(UIMA_ANNOTATION_TYPES_PACKAGE + typeName,
"Automatically generated type for " + typeName, "uima.tcas.Annotation");
LOG.info("Inserted new type -> " + typeName);
}
}
/**
* Removes dashes from UIMA Annotations because they are not allowed to contain dashes.
*
* @param typeName the annotation name of the current annotation of the source document
* @return the transformed annotation name suited for the UIMA typesystem
*/
private String removeDashes(String typeName) {
if (typeName.contains("-")) {
typeName = typeName.replaceAll("-", "_");
}
return typeName;
}
@Override
public void setSourceDocumentText(CAS cas) {
cas.setSofaDataString(startDocument.text, "text/plain");
}
private void resizeArrayFS(int newCapacity, ArrayFS originalArray, CAS cas) {
ArrayFS biggerArrayFS = cas.createArrayFS(newCapacity);
biggerArrayFS.copyFromArray(originalArray.toArray(), 0, 0, originalArray.size());
this.annotationFeatureStructures = biggerArrayFS;
this.featureStructureArrayCapacity = annotationFeatureStructures.size();
}
}
`如果有人處理了UIMA類型的注釋轉換,我將不勝感激。
我認為您對CASes和Annotations的理解可能是錯誤的:
從
* This is a hack allowing the CAS object to have an updated type system.
和
LOG.info("Creating new CAS with updated typesystem...");
cas = CasCreationUtils.createCas(tsd, null, null);
我想你試着在Annotator的process()方法中創建一個新的CAS(我假設你發布的代碼在那里執行)。 除非您正在實施CAS乘數,否則這不是實現它的方法。 通常,collectionreader會在其getNext()方法中提取原始數據並為您創建CAS。 此CAS在整個UIMA管道中傳遞,您需要做的就是為其添加UIMA注釋。
對於要添加的每個注釋,UIMA應該知道類型系統。 如果您使用JCasGen及其生成的代碼,這應該不是問題。 確保您的類型可以按照此處所述進行自動檢測: http : //uima.apache.org/d/uimafit-current/tools.uimafit.book.html#d5e531 )。
這允許您使用Java對象實例化注釋,而不是使用低級別的Fs調用。 以下代碼段在整個文檔文本中添加注釋。 將迭代邏輯添加到文本中的標記及其攝取(非UIMA)注釋(使用您的Web服務)應該是微不足道的。
@Override
public void process(JCas aJCas) throws AnalysisEngineProcessException {
String text = aJCas.getDocumentText();
SomeAnnotation a = new SomeAnnotation(aJCas);
// set the annotation properties
// for each property, JCasGen should have
// generated a setter
a.setSomePropertyValue(someValue);
// add your annotation to the indexes
a.setBegin(0);
a.setEnd(text.length());
a.addToIndexes(aJCas);
}
為了避免搞亂開始和結束String索引,我建議你使用一些令牌注釋(來自DKPro Core,例如: https ://dkpro.github.io/dkpro-core/),你可以用作錨點指向您的自定義注釋。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.