I am currently working on a project which involves finding the 'domains of knowledge' a certain key-word is related to. I plan to do this using DMOZ. For example, 'Brad Pitt' gives
Arts: People: P: Pitt, Brad: Fan Pages (10)
Arts: People: P: Pitt, Brad: Articles and Interviews (5)
Arts: People: P: Pitt, Brad (4)
Arts: People: P: Pitt, Brad: Image Galleries (2)
Arts: People: P: Pitt, Brad: Movies (2)
and so on...
I have the structure.rdf.u8 dump from DMOZ website. Someone had mentioned to me that if I do not need the URLs, just this file is enough(I don't need the websites, only the categories pertaining to keywords). Or do I need the content file also?
Moreover, I would like to know the best way to parse the structure file using Python (any library). I don't have any knowledge of XML, though I am good with Python.
I started with https://github.com/kremso/dmoz-parser and made a simple topic filter : https://github.com/lawrencecreates/dmoz-parser/blob/master/sample.py#L6
class LawrenceFilter:
def __init__(self):
self._file = open("seeds.txt", 'w')
def page(self, page, content):
if page != None and page != "":
topic = content['topic']
if topic.find('United_States/Kansas/Localities/L/Lawrence') > 0 :
self._file.write(page + "\n")
print "found page %s in topic %s" % (page , topic)
def finish(self):
self._file.close()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.