简体   繁体   English

使用Ajax将pdf发布到Solr

[英]Posting a pdf to Solr using Ajax

I am trying to push (Post) pdf files to Solr/Tika for text extraction and indexing using Ajax/js. 我正在尝试将pdf文件推送(发布)到Solr / Tika,以使用Ajax / js进行文本提取和索引编制。 I've gotten the following curl command to work: 我已经使用以下curl命令来工作:

curl 'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true' -F "myfile=@/PathToFile/SomeDoc.pdf"

This command puts the desired pdf into the Solr Index, and I can retrieve it just fine. 此命令将所需的pdf放入Solr索引中,我可以正常检索它。 However, I need to be able to do this from a web browsers. 但是,我需要能够从Web浏览器执行此操作。 After much googling, and a little experimentation I've got the following js code ALMOST working. 经过大量的搜索和一些实验之后,我有了以下JS代码ALMOST。 It returns a 0 status code, and status of Success, but nothing gets committed to the index: 它返回0状态码和成功状态,但没有任何东西提交给索引:

   $("#solrPost").click(function(event) {
        event.stopPropagation();
        event.preventDefault();

        /* Read a local pdf file as a blob */
        let fileAsBlob = null;
        let file = $('#upload_file')[0].files[0];
        let myReader = new FileReader();

        myReader.onloadend = function() {
            fileAsBlob = myReader.result;
            sendToSolr(fileAsBlob); 
        };
        fileAsBlob = myReader.readAsArrayBuffer(file);

        function sendToSolr(fileAsBlob) {
            $.ajax({ 
                url:"http://localhost:8983/solr/techproducts/update/extract?literal.id=doc2&commit=true",
                type: 'POST',
                data: fileAsBlob,
                cache: false,
                crossOrigin: true,
                dataType: 'jsonp',
                jsonp: 'json.wrf',
                processData: false,
                contentType: false, 

                success: function(data, status) {
                    console.log("Ajax.post successful, status: " + data.responseHeader.status + "\t status text: " + status);
                    console.log("debug");
                },
                error: function(data, status) {
                    console.log("Ajax.post error, status: " + data.status + "\t status text:" + data.statusText);
                },
                done: function(data, status) {
                    console.log("Ajax.post Done");
                }
            });
        }

This is SO close to working, but I just can't figure out what's going wrong. 这已经很接近工作了,但是我无法弄清楚出了什么问题。 All indications (From client side) are good, but nothing added to the index. 所有指示(从客户端)都是好的,但是没有任何内容添加到索引中。 Note: 注意:

  1. The fileReader is working, I see an Array of the same size as the source pdf. fileReader正在运行,我看到一个与源pdf大小相同的数组。
  2. Even though I specify POST, when I examine the network tab in the browser/debugger, it says GET. 即使我指定了POST,当我在浏览器/调试器中检查“网络”选项卡时,它也会显示GET。
  3. I've hardcoded the literal.id=doc2 for simplicity, not a long term strategy... 为了简单起见,我已经硬编码了literal.id = doc2,而不是长期的策略...

I know there are similar posts, but none address the issue of extracting pdf's using Solr/Tika outside of the provided post script. 我知道也有类似的帖子,但是都没有解决在提供的帖子脚本之外使用Solr / Tika提取pdf的问题。 Thanks in advance. 提前致谢。

Well it took some searching but thanks to a post by "tonejac" I found the solution. 好吧,它花了一些时间搜索,但是由于“ tonejac”的帖子,我找到了解决方案。 If you look at: [ JQuery Ajax is sending GET instead of POST The VERY last comment states that if you use dataType:jsonp that "POST" gets converted to "GET". 如果您查看:[ JQuery Ajax正在发送GET而不是POST 。最后一个注释指出,如果您使用dataType:jsonp,则“ POST”将转换为“ GET”。 I deleted the jsonp, installed a plugin to handle the CORS issue I was trying to avoid by using jsonp, and viola, it worked. 我删除了jsonp,安装了一个插件来处理我试图通过使用jsonp避免的CORS问题,而中提琴则奏效了。 For those interested, the working code is posted below. 对于那些感兴趣的人,工作代码发布在下面。 It's not fancy or robust but allows me to post or get documents (.pdf, .docx...) to Solr from a web app. 它不花哨或坚固,但允许我从Web应用程序发布或获取文档(.pdf,.docx ...)到Solr。 I've only posted the js code, but the html is simple and provides an input of type "file", as well as inputs to set id for posting docs, or searching by id. 我只发布了js代码,但html很简单,并提供了“文件”类型的输入,以及设置ID以便发布文档或按ID搜索的输入。 There are two buttons, solrPost, and solrGet which call the listeners in the js. 有两个按钮,solrPost和solrGet,它们调用js中的侦听器。 The connectSolr() function is called from the html onLoad. 从html onLoad调用connectSolr()函数。

function connectSolr() {
$("#solrPost").click(function(event) {
    event.stopPropagation();
    event.preventDefault();

    /* Read a local pdf file as a blob */
    let fileAsBlob = null;
    let file = $('#upload_file')[0].files[0];
    let myReader = new FileReader();

    myReader.onloadend = function() {
        fileAsBlob = myReader.result;

        sendToSolr(fileAsBlob); 
    };
    fileAsBlob = myReader.readAsArrayBuffer(file);
    /* Get the unique Id for the doc and append to the extract url*/
    let docId = $("#userSetId").val();
    let extractUrl = "http://localhost:8983/solr/techproducts/update/extract/?commit=true&literal.id=" + docId;


    /* Ajax call to Solr/Tika to extract text from pdf and index it */
    function sendToSolr(fileAsBlob) {
        $.ajax({ 
            url: extractUrl,
            type: 'POST',
            data: fileAsBlob,
            cache: false,
            jsonp: 'json.wrf',
            processData: false,
            contentType: false, 
            echoParams: "all",

            success: function(data, status) {
                console.log("Ajax.post successful, status: " + data.responseHeader.status + "\t status text: " + status);
                console.log("debug");
            },
            error: function(data, status) {
                console.log("Ajax.post error, status: " + data.status + "\t status text:" + data.statusText);
            },
            done: function(data, status) {
                console.log("Ajax.post Done");
            },
        });
    }
});


$("#solrGet").click(function(event) {
    event.stopPropagation();
    event.preventDefault();
    let docId = "id:" + $("#docId").val();
    $.ajax({
        url:"http://localhost:8983/solr/techproducts/select/",
        type: "get",
        dataType: "jsonp",
        data: {
            q: docId
            //wt: "json",
            //indent: "true"
        },
        jsonp: "json.wrf",
        //"json.wrf": "?",
        success: function(data, status) {
            renderDoc(data, status);
        },
        error: function(data, status) {
            console.log("Ajax.get error, Error: " + status);
        },
        done: function(data, status) {
            console.log("Ajax.get Done");
        }
    });
    console.log("Debug");
});


let  renderDoc = function(theText, statusCode) {
    let extractedText = theText.response.docs[0].content[0];
    let extractedLinks = theText.response.docs[0].links;
    let $textArea = $("#textArea");
    $textArea.empty();
    let sents = extractedText.split('\n')
    sents.map(function(element, i) {
        let newSpan = $("<span />");
        $textArea.append(newSpan.html(element).append("<br/>"));
    });
    console.log("debug");
};

} }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM