在火花环境中的Uima Ruta Out of Memory问题

Question

I'm running an UIMA application on apache spark. 我在apache spark上运行UIMA应用程序。 There are million of pages coming into batches to be processed by UIMA RUTA for calculation. UIMA RUTA需要处理数百万个页面才能进行计算。 But some time i'm facing out of memory exception.It throws exception sometime as it successfully process 2000 pages but some time fail on 500 pages. 但是有一段时间我面临内存异常。它会在成功处理2000页时抛出异常，但有些时候会在500页上失败。

Application Log 应用日志

Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.uima.internal.util.IntArrayUtils.expand_size(IntArrayUtils.java:57)
        at org.apache.uima.internal.util.IntArrayUtils.ensure_size(IntArrayUtils.java:39)
        at org.apache.uima.cas.impl.Heap.grow(Heap.java:187)
        at org.apache.uima.cas.impl.Heap.add(Heap.java:241)
        at org.apache.uima.cas.impl.CASImpl.ll_createFS(CASImpl.java:2844)
        at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:489)
        at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotations(RuleMatch.java:172)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotationsOf(RuleMatch.java:68)
        at org.apache.uima.ruta.rule.RuleMatch.getLastMatchedAnnotation(RuleMatch.java:73)
        at org.apache.uima.ruta.rule.ComposedRuleElement.mergeDisjunctiveRuleMatches(ComposedRuleElement.java:330)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:213)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)

UIMA RUTA SCRIPT UIMA RUTA SCRIPT

WORDLIST EnglishStopWordList = 'stopWords.txt';
WORDLIST FiltersList = 'AnchorFilters.txt';
DECLARE Filters, EnglishStopWords;
DECLARE Anchors, SpanStart,SpanClose;

DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)};

DocumentAnnotation{-> MARKFAST(Filters, FiltersList)};

STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+";

DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)};
(SW | CW | CAP ) { -> MARK(Anchors, 1, 2)};
Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)};

(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)};

Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)};
MixCharacterRegex -> Anchors;

"<Value>"  -> SpanStart;
"</Value>" -> SpanClose;

Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)};

SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};

Answer 1

Normally, the reasons for high memory usage in UIMA Ruta can be found in RutaBasic (many annotation, coverage information) or in RuleMatch (inefficient rules, many rule element matches). 通常，UIMA Ruta中高内存使用的原因可以在RutaBasic（许多注释，覆盖信息）或RuleMatch（低效规则，许多规则元素匹配）中找到。

This your example, the problem seems to origin somewhere else. 这是你的例子，这个问题似乎起源于其他地方。 The stacktrace indicates that the memory is used up by some disjunctive rule element, which requires to create new annotations for storing the match information. 堆栈跟踪指示内存被某些析取规则元素用尽，这需要创建用于存储匹配信息的新注释。

It seems that the version of UIMA Ruta is rather old since line number do not match at all with the source I am looking at. 似乎UIMA Ruta的版本相当陈旧，因为行号与我正在查看的源根本不匹配。

There are seven (!!!) calls of continueOwnMatch in the stacktrace. stacktrace中有七个（!!!）调用continueOwnMatch 。 I was looking for a rule that could cause something like this but found none. 我一直在寻找一个可能导致这样的事情的规则但却没有找到。 This could be a old flaw which has been fixed in newer versions, or some preprocessing added additional CW/SW/CAP annotations. 这可能是一个旧版本，已在较新版本中修复，或者某些预处理添加了额外的CW / SW / CAP注释。

As a first advice, I would suggest two things: 作为第一个建议，我建议两件事：

Update to UIMA Ruta 2.6.0 更新到UIMA Ruta 2.6.0
Get rid of all disjunctive rule elements 摆脱所有析取规则元素

The disjunctive rule elements are not really needed in your script. 您的脚本中并不真正需要析取规则元素。 In general, they should not used at all if not really required. 一般来说，如果不是真的需要它们就不应该使用。 I do not use them at all in productive rules. 我根本没有在生产规则中使用它们。

Instead of (SW | CW | CAP ) you can simply write W . 而不是(SW | CW | CAP )你可以简单地写W

Instead of (SPECIAL{REGEXP("['\\"-=()\\\\[\\\\]]")}| PM) you can write ANY{OR(REGEXP("['\\"-=()\\\\[\\\\]]"),IS(PM))} . 而不是(SPECIAL{REGEXP("['\\"-=()\\\\[\\\\]]")}| PM)你可以写ANY{OR(REGEXP("['\\"-=()\\\\[\\\\]]"),IS(PM))} 。

Using ANY as a matching condition can reduce the runtime performance. 使用ANY作为匹配条件可以降低运行时性能。 In this example, two rules instead of the rule lement rewrite might be better, eg, something like 在这个例子中，两个规则而不是规则lement重写可能更好，例如，类似的东西

SPECIAL{REGEXP("['\"-=()\\[\\]]")} W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};
PM W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};

(optional rule elements at the start of a rule without any anchors in the rule are not optional) （规则开头的可选规则元素在规则中没有任何锚点是不可选的）

btw, there is a lot of room for optimization in your rules. 顺便说一句，你的规则有很多优化空间。 If I had to guess, I'd say you can get rid at least of half the rules and 90% of all created annotations, which would also considerably reduce the memory usage. 如果我不得不猜测，我会说你可以删除至少一半的规则和90％的所有创建注释，这也会大大减少内存使用量。

DISCLAIMER: I am a developer of UIMA Ruta 免责声明：我是UIMA Ruta的开发人员

在火花环境中的Uima Ruta Out of Memory问题

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-06-08 19:54:41

在火花环境中的Uima Ruta Out of Memory问题

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-06-08 19:54:41

解决方案1
2 已采纳 2017-06-08 19:54:41