Python lucene function add field contents to document not working

Question

I am indexing url pages with python lucene.

I had some errors trying to add fields to the Document. I am not sure why. The error says:

JavaError: , > Java stacktrace: java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored at org.apache.lucene.document.Field.(Field.java:249)

in line where I put: doc.add(Field("contents", text, t2))

The python code I used is:

def IndexerForUrl(start, number, domain):

lucene.initVM()
# join base dir and index dir
path = os.path.abspath("paths")
directory = SimpleFSDirectory(Paths.get(path)) # the index

analyzer = StandardAnalyzer()

writerConfig = IndexWriterConfig(analyzer)

writerConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE)

writer = IndexWriter(directory, writerConfig)

print "reading lines from sys.std..."

# hashtable dictionary
D = {}

D[start] = [start]



numVisited = 0
wordBool = False

n = start

queue = [start]
visited = set()

t1 = FieldType()
t1.setStored(True)
t1.setTokenized(False)

t2 = FieldType()
t2.setStored(False)
t2.setTokenized(True)



while numVisited < number and queue and not wordBool:
    pg = queue.pop(0)

    if pg not in visited:

        visited.add(pg)

        htmlwebpg = urllib2.urlopen(pg).read()
            # robot exclusion standard
        rp = robotparser.RobotFileParser()
        rp.set_url(pg)
        rp.read() # read robots.txt url and feeds to parser


        soup = BeautifulSoup(htmlwebpg, 'html.parser')

        for script in soup(["script","style"]):
            script.extract()
        text = soup.get_text()



        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = '\n'.join(chunk for chunk in chunks if chunk)

        print text




        doc = Document()

        doc.add(Field("urlpath", pg, t2))
        if len(text)> 0:
            doc.add(Field("contents", text, t2))
        else:
            print "warning: no content in %s " % pgv

        writer.addDocument(doc)


        numVisited = numVisited+1

        linkset = set()

            # add to list
        for link in soup.findAll('a', attrs={'href':re.compile("^http://")}):
                #links.append(link.get('href'))
            if rp.can_fetch(link.get('href')):
                linkset.add(link.get('href'))

            D[pg] = linkset

            queue.extend(D[pg] - visited)

writer.commit()
writer.close()
directory.close() #close the index 
return writer

Answer 1

If a field is neither indexed nor stored, it would not be represented in the index in any way, thus it doesn't make sense for it to be there. I'm guessing that you want to index FieldType t2. To do that, you need to set the IndexOptions , something like:

t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)

Python lucene function add field contents to document not working

Question

1 answers

solution1
0 2017-02-18 06:36:16

Python lucene function add field contents to document not working

Question

1 answers

solution1 0 2017-02-18 06:36:16

solution1
0 2017-02-18 06:36:16