[英]Python lucene function add field contents to document not working
I am indexing url pages with python lucene. 我正在用python Lucene索引URL页面。
I had some errors trying to add fields to the Document. 我在尝试向文档中添加字段时遇到了一些错误。 I am not sure why. 我不知道为什么。 The error says: 错误提示:
JavaError: , > Java stacktrace: java.lang.IllegalArgumentException: it doesn't make sense to have a field that is neither indexed nor stored at org.apache.lucene.document.Field.(Field.java:249) JavaError:,> Java stacktrace:java.lang.IllegalArgumentException:既没有索引也没有存储在org.apache.lucene.document.Field。(Field.java:249)的字段没有任何意义
in line where I put: doc.add(Field("contents", text, t2)) 在我放置的行中:doc.add(Field(“ contents”,text,t2))
The python code I used is: 我使用的python代码是:
def IndexerForUrl(start, number, domain):
lucene.initVM()
# join base dir and index dir
path = os.path.abspath("paths")
directory = SimpleFSDirectory(Paths.get(path)) # the index
analyzer = StandardAnalyzer()
writerConfig = IndexWriterConfig(analyzer)
writerConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
writer = IndexWriter(directory, writerConfig)
print "reading lines from sys.std..."
# hashtable dictionary
D = {}
D[start] = [start]
numVisited = 0
wordBool = False
n = start
queue = [start]
visited = set()
t1 = FieldType()
t1.setStored(True)
t1.setTokenized(False)
t2 = FieldType()
t2.setStored(False)
t2.setTokenized(True)
while numVisited < number and queue and not wordBool:
pg = queue.pop(0)
if pg not in visited:
visited.add(pg)
htmlwebpg = urllib2.urlopen(pg).read()
# robot exclusion standard
rp = robotparser.RobotFileParser()
rp.set_url(pg)
rp.read() # read robots.txt url and feeds to parser
soup = BeautifulSoup(htmlwebpg, 'html.parser')
for script in soup(["script","style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
print text
doc = Document()
doc.add(Field("urlpath", pg, t2))
if len(text)> 0:
doc.add(Field("contents", text, t2))
else:
print "warning: no content in %s " % pgv
writer.addDocument(doc)
numVisited = numVisited+1
linkset = set()
# add to list
for link in soup.findAll('a', attrs={'href':re.compile("^http://")}):
#links.append(link.get('href'))
if rp.can_fetch(link.get('href')):
linkset.add(link.get('href'))
D[pg] = linkset
queue.extend(D[pg] - visited)
writer.commit()
writer.close()
directory.close() #close the index
return writer
If a field is neither indexed nor stored, it would not be represented in the index in any way, thus it doesn't make sense for it to be there. 如果一个字段既没有索引也没有存储,那么它就不会以任何方式在索引中表示,因此将其存在那里是没有意义的。 I'm guessing that you want to index FieldType t2. 我猜您想索引FieldType t2。 To do that, you need to set the IndexOptions , something like: 为此,您需要设置IndexOptions ,类似于:
t2.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.