[英]Integrate extracted PDF content with django-haystack
I have extracted PDF/DOCX content with Solr and I've suceeded to establish some search queries using the following Solr URL dedicated to this : 我已经使用Solr提取了PDF / DOCX内容,并且我已经使用以下专用于此的Solr URL来建立一些搜索查询:
http://localhost:8983/solr/select?q=Lycee
I would like to establish a such query with django-haystack. 我想用django-haystack建立一个这样的查询。 I have found this link which is talking about the issue :
我发现这个链接正在讨论这个问题:
https://github.com/toastdriven/django-haystack/blob/master/docs/rich_content_extraction.rst https://github.com/toastdriven/django-haystack/blob/master/docs/rich_content_extraction.rst
But there is no "FileIndex" class with django-haystack (2.0.0-beta). 但是没有带有django-haystack(2.0.0-beta)的“FileIndex”类。 How can I integrate a such search within django-haystack ?
如何在django-haystack中集成这样的搜索?
The "FileIndex" referenced in the documentation is a hypothetical subclass of haystack.indexes.SearchIndex. 文档中引用的“FileIndex”是haystack.indexes.SearchIndex的假设子类。 Here is an example:
这是一个例子:
from haystack import indexes
from myapp.models import MyFile
class FileIndex(indexes.SearchIndex, indexes.Indexable):
text = indexes.CharField(document=True, use_template=True)
title = indexes.CharField(model_attr='title')
owner = indexes.CharField(model_attr='owner__name')
def get_model(self):
return MyFile
def index_queryset(self, using=None):
return self.get_model().objects.all()
def prepare(self, obj):
data = super(FileIndex, self).prepare(obj)
# This could also be a regular Python open() call, a StringIO instance
# or the result of opening a URL. Note that due to a library limitation
# file_obj must have a .name attribute even if you need to set one
# manually before calling extract_file_contents:
file_obj = obj.the_file.open()
extracted_data = self.backend.extract_file_contents(file_obj)
# Now we'll finally perform the template processing to render the
# text field with *all* of our metadata visible for templating:
t = loader.select_template(('search/indexes/myapp/myfile_text.txt', ))
data['text'] = t.render(Context({'object': obj,
'extracted': extracted_data}))
return data
So extracted_data
would be replaced with whatever process you came up with to extract the PDF/DOCX content. 因此,
extracted_data
将替换为您提出的用于提取PDF / DOCX内容的任何过程。 You would then update your template to include that data. 然后,您将更新模板以包含该数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.