简体   繁体   English

marklogic删除>插入>对新文档的cpf操作

[英]marklogic delete > insert > cpf action on new document

see UPDATE below! 请参阅下面的更新!

I have the following issue: We are collecting (millions) of documents (tweets) into ML and on insert we have a cpf job that creates metadata for each document. 我有以下问题:我们正在将(数百万个)文档(推文)收集到ML中,并且在插入时,我们有一个cpf作业为每个文档创建元数据。 More precise it adds a geotag based on location(if location or coordinates are present). 更精确地说,它会根据位置(如果存在位置或坐标)添加地理标记。

Now we have a database that has been collecting tweets without the geotagger active. 现在,我们有了一个数据库,该数据库一直在未激活geotagger的情况下收集推文。 We would like to process the stored tweets with this cpf job by deleting and re-inserting each document that does not yet have a proper metadata geotag element. 我们希望通过删除并重新插入每个尚不具有适当的元数据geotag元素的文档来使用此cpf作业来处理存储的tweet。 Then cpf does it's job and geotaggs the "new" document. 然后,cpf完成其工作,并对“新”文档进行地理标记。

We have written the following code to delete and insert the documents but we get a XDMP-CONFLICTUPDATES error. 我们编写了以下代码来删除和插入文档,但出现XDMP-CONFLICTUPDATES错误。 I have been reading about transactions and tried several things, the ";" 我一直在阅读有关交易的内容,并尝试了几项“;” trick. 特技。 wrapping in xdmp:eval or splitting up the delete and insert in two separate function calls from xdmp:spawn. 包装在xdmp:eval中,或者拆分删除并从xdmp:spawn中插入两个单独的函数调用中。

Still no luck. 仍然没有运气。

spawn-rename.xqy 产卵,rename.xqy

xquery version "1.0-ml";

declare namespace j = "http://marklogic.com/xdmp/json/basic";
declare variable $to_process external;

declare function local:document-rename(
   $old-uri as xs:string, $new-uri as xs:string)
  as empty-sequence()
{
    (:xdmp:set-transaction-mode("update"),:)
    xdmp:eval(xdmp:document-delete($old-uri)),
    (:xdmp:commit():)

    let $permissions := xdmp:document-get-permissions($old-uri)
    let $collections := xdmp:document-get-collections($old-uri)
    return xdmp:document-insert(
      $new-uri, doc($old-uri),
      if ($permissions) then $permissions
      else xdmp:default-permissions(),
      if ($collections) then $collections
      else xdmp:default-collections(),
      xdmp:document-get-quality($old-uri)
    )
};

for $d in map:keys($to_process)
let $rename := local:document-rename($d, map:get($to_process,$d))
return true()

and to run the job for a specific set of documents we use: 并为我们使用的一组特定文档运行作业:

xquery version "1.0-ml";
declare namespace j = "http://marklogic.com/xdmp/json/basic";
declare namespace dikw = 'http://www.example.com/dikw_functions.xqy';
import module namespace json = "http://marklogic.com/xdmp/json" at "/MarkLogic/json/json.xqy";

let $foo := cts:uris((),(), cts:not-query(cts:element-query(xs:QName("j:dikwmetadata"), cts:element-query(xs:QName("j:data"), cts:and-query(())))))
let $items := cts:uri-match("/twitter/403580066367815680.json") (:any valid uri or set of uris:)

let $map := map:map()

    let $f := doc($items[1])
    let $id := $f/j:json/j:id/text()
    let $oldUri := xdmp:node-uri($f)
    let $newUri := fn:concat("/twitter/", $f/j:json/j:id/text(), ".json")
    let $put := map:put($map,$oldUri,$newUri)

    let $spawn := xdmp:spawn("/Modules/DIKW/spawn-rename-split.xqy", (xs:QName("to_process"), $map))

return ($oldUri, " - ", $newUri) 

Question: 题:

How can I set up the code so that it deleted the documents in the map first in a separate transaction and inserts them back later so cpf can do it's geotagging? 如何设置代码,使其首先在单独的事务中删除地图中的文档,然后再将其插入,以便cpf可以进行地理标记?


UPDATE UPDATE

Ok so per grtjn his comments (thx so far!) I try to rewrite my code like : 好了,按照grtjn的意见(到目前为止),我尝试像下面这样重写我的代码:

xquery version "1.0-ml";
declare namespace j = "http://marklogic.com/xdmp/json/basic";

let $entries := cts:uri-match("//twitter/*")
let $entry-count := fn:count($entries)

let $transaction-size := 100 (: batch size $max :)
let $total-transactions := ceiling($entry-count div $transaction-size)

(: set total documents and total transactions so UI displays collecting :)
(: skip 84 85
let $set-total := infodev:ticket-set-total-documents($ticket-id, $entry-count)
let $set-trans := infodev:ticket-set-total-transactions($ticket-id,$total-transactions)
:)
    (: create transactions by breaking document set into maps
each maps's documents are saved to the db in their own transaction :)
let $transactions :=
    for $i at $index in 1 to $total-transactions
    let $map := map:map()
    let $start := (($i -1) *$transaction-size) + 1
    let $finish := min((($start - 1 + $transaction-size),$entry-count))
    let $put :=
        for $entry in ($entries)[$start to $finish]
        (: 96
        let $id := fn:concat(fn:string($entry/atom:id),".xml")
        :)
        let $id := fn:doc($entry)/j:json/j:id/text()
        return map:put($map,$id,$entry)
    return $map

(: the callback function for ingest 
skip 101 let $function := xdmp:function(xs:QName("feed:process-file"))
:)
let $ingestion :=
    for $transaction at $index in $transactions
    return true()
    return $ingestion (: this second return statement seems odd? :)
    (: do spawn here? :)
    (: xdmp:spawn("/modules/spawn-move.xqy", (xs:QName("to_process"), $map)) :)

Now I am puzzled, to get this 'working' I needed to add the last return which seems not right. 现在我很困惑,要获得这个“正常”,我需要添加似乎不正确的最后一个回报。 Also I am trying to figure out what exactly happens, If I run the query as is it returns with a timeout error. 我也试图弄清楚到底发生了什么,如果我按原样运行查询,它将返回超时错误。 I would like to first understand what the transaction actually does. 我想先了解一下交易实际上是做什么的。 Sorry for my ignorance but seems that performing a (relatively simple) task as renaming some docs looks not that simple? 抱歉,我的无知,但似乎在重命名某些文档时执行(相对简单的)任务看起来并不那么简单?

for completeness my spawn-move.qry here: 为了完整起见,我的spawn-move.qry在这里:

xquery version "1.0-ml";

declare namespace j = "http://marklogic.com/xdmp/json/basic";
declare variable $to_process external;


declare function local:document-move(
   $id as xs:string, $doc as xs:string)
  as empty-sequence()
{
    let $newUri := fn:concat("/twitter/", $id, ".json")
    let $ins := xdmp:document-insert($newUri,fn:doc($doc))
    let $del := xdmp:document-delete($doc) 
    return true()
};

for $d in map:keys($to_process)
let $move := local:document-move($d, map:get($to_process,$d))
return true()

I suspect you are not actually renaming the documents, but just re-inserting them. 我怀疑您实际上不是在重命名文档,而是在重新插入它们。 The rename function you quote does not anticipate that situation and does a superfluous document-delete if $old-uri is identical to $new-uri . 如果$old-uri$new-uri相同,则引用的rename函数不会出现这种情况,并且会多余地document-delete Add an if around the delete to skip it in that case. 在删除内容周围添加一个if ,在这种情况下可以将其跳过。 Keep everything else to preserve permissions, collections, quality, and properties. 保留所有其他内容以保留权限,集合,质量和属性。 The document-insert function already removes pre-existing document before the actual insert. document-insert功能已经在实际插入之前删除了先前存在的文档。 See also: 也可以看看:

http://docs.marklogic.com/xdmp:document-insert http://docs.marklogic.com/xdmp:document-insert

You might also consider adding a bit of logic to do multiple spawns. 您可能还考虑添加一些逻辑来执行多个生成。 You would want to re-insert docs in batches of 100 to 500 docs ideally, depending on hardware and forest config. 您可能希望根据硬件和林配置,以100到500个文档的批次重新插入文档。 There is a nice example of how to calculate 'transactions' in this infostudio collector on github (starting from line 80): 有一个很好的示例,说明如何在github上的此infostudio收集器中计算“交易”(从第80行开始):

https://github.com/marklogic/infostudio-plugins/blob/master/collectors/collector-feed.xqy https://github.com/marklogic/infostudio-plugins/blob/master/collectors/collector-feed.xqy

You can also consider doing the geo-work inside those transactions, instead of delegating that to CPF. 您也可以考虑在这些交易中进行地勤工作,而不是将其委托给CPF。 But if your geo-lookup involves external calls, which could for instance be slow, then stick with CPF.. 但是,如果您的地理位置查询涉及外部呼叫(例如,速度可能很慢),则请坚持使用CPF。

HTH! HTH!

It looks like in your sample that you are trying to delete and write the document to the same URI in the same step. 在示例中,您似乎试图在同一步骤中删除文档并将其写入相同的URI。 you may get around this with xdmp:commit(). 您可以使用xdmp:commit()解决此问题。 However, another solution would be to first rename the document in one batch (move them ALL out of the way) and then after that is done, move them back in batches. 但是,另一种解决方案是先批量重命名文档(将它们全部移开),然后再将其分批移回。

Actually, if you have your CPF pipelines configured to handle updates like creates (this is the default configuration) then just reinserting the document is enough: 实际上,如果您已将CPF管道配置为处理诸如create之类的更新(这是默认配置),则只需重新插入文档就足够了:

xdmp:document-insert($d, doc($d)) xdmp:document-insert($ d,doc($ d))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM