简体   繁体   English

Python Lucene函数将字段内容添加到文档中不起作用

[英]Python lucene function add field contents to document not working

I am indexing url pages with python lucene. 我正在用python Lucene索引URL页面。

I had some errors trying to add fields to the Document. 我在尝试向文档中添加字段时遇到了一些错误。 I am not sure why. 我不知道为什么。 The error says: 错误提示:

JavaError: , > Java stacktrace: java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored at org.apache.lucene.document.Field.(Field.java:249) JavaError:,> Java stacktrace:java.lang.IllegalArgumentException:既没有索引也没有存储在org.apache.lucene.document.Field。(Field.java:249)的字段没有任何意义

in line where I put: doc.add(Field("contents", text, t2)) 在我放置的行中:doc.add(Field(“ contents”,text,t2))

The python code I used is: 我使用的python代码是:

def IndexerForUrl(start, number, domain):

lucene.initVM()
# join base dir and index dir
path = os.path.abspath("paths")
directory = SimpleFSDirectory(Paths.get(path)) # the index

analyzer = StandardAnalyzer()

writerConfig = IndexWriterConfig(analyzer)

writerConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE)

writer = IndexWriter(directory, writerConfig)

print "reading lines from sys.std..."

# hashtable dictionary
D = {}

D[start] = [start]



numVisited = 0
wordBool = False

n = start

queue = [start]
visited = set()

t1 = FieldType()
t1.setStored(True)
t1.setTokenized(False)

t2 = FieldType()
t2.setStored(False)
t2.setTokenized(True)



while numVisited < number and queue and not wordBool:
    pg = queue.pop(0)

    if pg not in visited:

        visited.add(pg)

        htmlwebpg = urllib2.urlopen(pg).read()
            # robot exclusion standard
        rp = robotparser.RobotFileParser()
        rp.set_url(pg)
        rp.read() # read robots.txt url and feeds to parser


        soup = BeautifulSoup(htmlwebpg, 'html.parser')

        for script in soup(["script","style"]):
            script.extract()
        text = soup.get_text()



        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = '\n'.join(chunk for chunk in chunks if chunk)

        print text




        doc = Document()

        doc.add(Field("urlpath", pg, t2))
        if len(text)> 0:
            doc.add(Field("contents", text, t2))
        else:
            print "warning: no content in %s " % pgv

        writer.addDocument(doc)


        numVisited = numVisited+1

        linkset = set()

            # add to list
        for link in soup.findAll('a', attrs={'href':re.compile("^http://")}):
                #links.append(link.get('href'))
            if rp.can_fetch(link.get('href')):
                linkset.add(link.get('href'))

            D[pg] = linkset

            queue.extend(D[pg] - visited)

writer.commit()
writer.close()
directory.close() #close the index 
return writer

If a field is neither indexed nor stored, it would not be represented in the index in any way, thus it doesn't make sense for it to be there. 如果一个字段既没有索引也没有存储,那么它就不会以任何方式在索引中表示,因此将其存在那里是没有意义的。 I'm guessing that you want to index FieldType t2. 我猜您想索引FieldType t2。 To do that, you need to set the IndexOptions , something like: 为此,您需要设置IndexOptions ,类似于:

t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM