简体   繁体   中英

Python lucene function add field contents to document not working

I am indexing url pages with python lucene.

I had some errors trying to add fields to the Document. I am not sure why. The error says:

JavaError: , > Java stacktrace: java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored at org.apache.lucene.document.Field.(Field.java:249)

in line where I put: doc.add(Field("contents", text, t2))

The python code I used is:

def IndexerForUrl(start, number, domain):

lucene.initVM()
# join base dir and index dir
path = os.path.abspath("paths")
directory = SimpleFSDirectory(Paths.get(path)) # the index

analyzer = StandardAnalyzer()

writerConfig = IndexWriterConfig(analyzer)

writerConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE)

writer = IndexWriter(directory, writerConfig)

print "reading lines from sys.std..."

# hashtable dictionary
D = {}

D[start] = [start]



numVisited = 0
wordBool = False

n = start

queue = [start]
visited = set()

t1 = FieldType()
t1.setStored(True)
t1.setTokenized(False)

t2 = FieldType()
t2.setStored(False)
t2.setTokenized(True)



while numVisited < number and queue and not wordBool:
    pg = queue.pop(0)

    if pg not in visited:

        visited.add(pg)

        htmlwebpg = urllib2.urlopen(pg).read()
            # robot exclusion standard
        rp = robotparser.RobotFileParser()
        rp.set_url(pg)
        rp.read() # read robots.txt url and feeds to parser


        soup = BeautifulSoup(htmlwebpg, 'html.parser')

        for script in soup(["script","style"]):
            script.extract()
        text = soup.get_text()



        lines = (line.strip() for line in text.splitlines())
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        text = '\n'.join(chunk for chunk in chunks if chunk)

        print text




        doc = Document()

        doc.add(Field("urlpath", pg, t2))
        if len(text)> 0:
            doc.add(Field("contents", text, t2))
        else:
            print "warning: no content in %s " % pgv

        writer.addDocument(doc)


        numVisited = numVisited+1

        linkset = set()

            # add to list
        for link in soup.findAll('a', attrs={'href':re.compile("^http://")}):
                #links.append(link.get('href'))
            if rp.can_fetch(link.get('href')):
                linkset.add(link.get('href'))

            D[pg] = linkset

            queue.extend(D[pg] - visited)

writer.commit()
writer.close()
directory.close() #close the index 
return writer

If a field is neither indexed nor stored, it would not be represented in the index in any way, thus it doesn't make sense for it to be there. I'm guessing that you want to index FieldType t2. To do that, you need to set the IndexOptions , something like:

t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM