简体   繁体   English

在火花环境中的Uima Ruta Out of Memory问题

[英]Uima Ruta Out of Memory issue in spark context

I'm running an UIMA application on apache spark. 我在apache spark上运行UIMA应用程序。 There are million of pages coming into batches to be processed by UIMA RUTA for calculation. UIMA RUTA需要处理数百万个页面才能进行计算。 But some time i'm facing out of memory exception.It throws exception sometime as it successfully process 2000 pages but some time fail on 500 pages. 但是有一段时间我面临内存异常。它会在成功处理2000页时抛出异常,但有些时候会在500页上失败。

Application Log 应用日志

Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.uima.internal.util.IntArrayUtils.expand_size(IntArrayUtils.java:57)
        at org.apache.uima.internal.util.IntArrayUtils.ensure_size(IntArrayUtils.java:39)
        at org.apache.uima.cas.impl.Heap.grow(Heap.java:187)
        at org.apache.uima.cas.impl.Heap.add(Heap.java:241)
        at org.apache.uima.cas.impl.CASImpl.ll_createFS(CASImpl.java:2844)
        at org.apache.uima.cas.impl.CASImpl.createFS(CASImpl.java:489)
        at org.apache.uima.cas.impl.CASImpl.createAnnotation(CASImpl.java:3837)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotations(RuleMatch.java:172)
        at org.apache.uima.ruta.rule.RuleMatch.getMatchedAnnotationsOf(RuleMatch.java:68)
        at org.apache.uima.ruta.rule.RuleMatch.getLastMatchedAnnotation(RuleMatch.java:73)
        at org.apache.uima.ruta.rule.ComposedRuleElement.mergeDisjunctiveRuleMatches(ComposedRuleElement.java:330)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:213)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueMatch(ComposedRuleElement.java:225)
        at org.apache.uima.ruta.rule.ComposedRuleElement.continueOwnMatch(ComposedRuleElement.java:362)
        at org.apache.uima.ruta.rule.ComposedRuleElement.fallbackContinue(ComposedRuleElement.java:459)

UIMA RUTA SCRIPT UIMA RUTA SCRIPT

WORDLIST EnglishStopWordList = 'stopWords.txt';
WORDLIST FiltersList = 'AnchorFilters.txt';
DECLARE Filters, EnglishStopWords;
DECLARE Anchors, SpanStart,SpanClose;

DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)};

DocumentAnnotation{-> MARKFAST(Filters, FiltersList)};

STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+";

DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)};
(SW | CW | CAP ) { -> MARK(Anchors, 1, 2)};
Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)};

(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)};

Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)};
MixCharacterRegex -> Anchors;

"<Value>"  -> SpanStart;
"</Value>" -> SpanClose;

Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)};

SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};

Normally, the reasons for high memory usage in UIMA Ruta can be found in RutaBasic (many annotation, coverage information) or in RuleMatch (inefficient rules, many rule element matches). 通常,UIMA Ruta中高内存使用的原因可以在RutaBasic(许多注释,覆盖信息)或RuleMatch(低效规则,许多规则元素匹配)中找到。

This your example, the problem seems to origin somewhere else. 这是你的例子,这个问题似乎起源于其他地方。 The stacktrace indicates that the memory is used up by some disjunctive rule element, which requires to create new annotations for storing the match information. 堆栈跟踪指示内存被某些析取规则元素用尽,这需要创建用于存储匹配信息的新注释。

It seems that the version of UIMA Ruta is rather old since line number do not match at all with the source I am looking at. 似乎UIMA Ruta的版本相当陈旧,因为行号与我正在查看的源根本不匹配。

There are seven (!!!) calls of continueOwnMatch in the stacktrace. stacktrace中有七个(!!!)调用continueOwnMatch I was looking for a rule that could cause something like this but found none. 我一直在寻找一个可能导致这样的事情的规则但却没有找到。 This could be a old flaw which has been fixed in newer versions, or some preprocessing added additional CW/SW/CAP annotations. 这可能是一个旧版本,已在较新版本中修复,或者某些预处理添加了额外的CW / SW / CAP注释。

As a first advice, I would suggest two things: 作为第一个建议,我建议两件事:

  1. Update to UIMA Ruta 2.6.0 更新到UIMA Ruta 2.6.0
  2. Get rid of all disjunctive rule elements 摆脱所有析取规则元素

The disjunctive rule elements are not really needed in your script. 您的脚本中并不真正需要析取规则元素。 In general, they should not used at all if not really required. 一般来说,如果不是真的需要它们就不应该使用。 I do not use them at all in productive rules. 我根本没有在生产规则中使用它们。

Instead of (SW | CW | CAP ) you can simply write W . 而不是(SW | CW | CAP )你可以简单地写W

Instead of (SPECIAL{REGEXP("['\\"-=()\\\\[\\\\]]")}| PM) you can write ANY{OR(REGEXP("['\\"-=()\\\\[\\\\]]"),IS(PM))} . 而不是(SPECIAL{REGEXP("['\\"-=()\\\\[\\\\]]")}| PM)你可以写ANY{OR(REGEXP("['\\"-=()\\\\[\\\\]]"),IS(PM))}

Using ANY as a matching condition can reduce the runtime performance. 使用ANY作为匹配条件可以降低运行时性能。 In this example, two rules instead of the rule lement rewrite might be better, eg, something like 在这个例子中,两个规则而不是规则lement重写可能更好,例如,类似的东西

SPECIAL{REGEXP("['\"-=()\\[\\]]")} W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};
PM W ANY?{OR(REGEXP("['\"-=()\\[\\]]"),IS(PM))} EnglishStopWords? { -> MARK(Anchors, 1, 4)};

(optional rule elements at the start of a rule without any anchors in the rule are not optional) (规则开头的可选规则元素在规则中没有任何锚点是不可选的)

btw, there is a lot of room for optimization in your rules. 顺便说一句,你的规则有很多优化空间。 If I had to guess, I'd say you can get rid at least of half the rules and 90% of all created annotations, which would also considerably reduce the memory usage. 如果我不得不猜测,我会说你可以删除至少一半的规则和90%的所有创建注释,这也会大大减少内存使用量。

DISCLAIMER: I am a developer of UIMA Ruta 免责声明:我是UIMA Ruta的开发人员

UIMA RUTA 在“ <!--”</div--><div id="text_translate"><p> 我正在使用 UIMA RUTA 来注释各种文档。 它们来自不同的来源,有时会出现字符 &lt;. 出现在文档的中间。 &lt;! 被注释为 MARKUP 并被其他注释忽略。</p><p> 有没有办法改变这种行为? 即使我关闭 MARKUP 注释,&lt; 之后的文本。 没有被任何其他注释注释。</p><p> 我找到了负责创建大部分 MARKUP 注释的代码部分(org.apache.uima.ruta.seed 包中的 DefaultSeeder),但我无法找到代码的哪一部分负责以 &lt; 开头的 MARKUP 注释!</p><p> 感谢您的任何建议!</p></div> - UIMA RUTA is not annotating text after “<!”

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用UIMA Ruta时出现内存问题 - Memory problems, while using UIMA Ruta UIMA RUTA中的数组IndexOutOfBound异常 - Array IndexOutOfBound exception in UIMA RUTA 如何在 UIMA RUTA 中设置变量 - How to set Variables in UIMA RUTA UIMA RUTA 在“ <!--”</div--><div id="text_translate"><p> 我正在使用 UIMA RUTA 来注释各种文档。 它们来自不同的来源,有时会出现字符 &lt;. 出现在文档的中间。 &lt;! 被注释为 MARKUP 并被其他注释忽略。</p><p> 有没有办法改变这种行为? 即使我关闭 MARKUP 注释,&lt; 之后的文本。 没有被任何其他注释注释。</p><p> 我找到了负责创建大部分 MARKUP 注释的代码部分(org.apache.uima.ruta.seed 包中的 DefaultSeeder),但我无法找到代码的哪一部分负责以 &lt; 开头的 MARKUP 注释!</p><p> 感谢您的任何建议!</p></div> - UIMA RUTA is not annotating text after “<!” 使用Java访问组合的UIMA Ruta批注 - Accessing combined UIMA Ruta Annotations in Java UIMA RUTA - 如何以特定顺序处理单词? - UIMA RUTA - How To Process Words In A Particular Order? UIMA ruta - 使用来自不同视图的注释 - UIMA ruta - Using annotations from different views 将HeidelTime添加为UIMA Ruta Workbench中的分析引擎 - Add HeidelTime as Analysis Engine in UIMA Ruta Workbench 用于运行UIMA Ruta脚本的Java API - Java API for running UIMA Ruta scripts 如何从uima-ruta脚本创建AnalysisEngineDescriptor以在SimplePipeline中使用 - How to create an AnalysisEngineDescriptor from an uima-ruta script to use in a SimplePipeline
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM