I am indexing url pages with python lucene.
I had some errors trying to add fields to the Document. I am not sure why. The error says:
JavaError: , > Java stacktrace: java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored at org.apache.lucene.document.Field.(Field.java:249)
in line where I put: doc.add(Field("contents", text, t2))
The python code I used is:
def IndexerForUrl(start, number, domain):
lucene.initVM()
# join base dir and index dir
path = os.path.abspath("paths")
directory = SimpleFSDirectory(Paths.get(path)) # the index
analyzer = StandardAnalyzer()
writerConfig = IndexWriterConfig(analyzer)
writerConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
writer = IndexWriter(directory, writerConfig)
print "reading lines from sys.std..."
# hashtable dictionary
D = {}
D[start] = [start]
numVisited = 0
wordBool = False
n = start
queue = [start]
visited = set()
t1 = FieldType()
t1.setStored(True)
t1.setTokenized(False)
t2 = FieldType()
t2.setStored(False)
t2.setTokenized(True)
while numVisited < number and queue and not wordBool:
pg = queue.pop(0)
if pg not in visited:
visited.add(pg)
htmlwebpg = urllib2.urlopen(pg).read()
# robot exclusion standard
rp = robotparser.RobotFileParser()
rp.set_url(pg)
rp.read() # read robots.txt url and feeds to parser
soup = BeautifulSoup(htmlwebpg, 'html.parser')
for script in soup(["script","style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print text
doc = Document()
doc.add(Field("urlpath", pg, t2))
if len(text)> 0:
doc.add(Field("contents", text, t2))
else:
print "warning: no content in %s " % pgv
writer.addDocument(doc)
numVisited = numVisited+1
linkset = set()
# add to list
for link in soup.findAll('a', attrs={'href':re.compile("^http://")}):
#links.append(link.get('href'))
if rp.can_fetch(link.get('href')):
linkset.add(link.get('href'))
D[pg] = linkset
queue.extend(D[pg] - visited)
writer.commit()
writer.close()
directory.close() #close the index
return writer
If a field is neither indexed nor stored, it would not be represented in the index in any way, thus it doesn't make sense for it to be there. I'm guessing that you want to index FieldType t2. To do that, you need to set the IndexOptions , something like:
t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.