简体   繁体   中英

Parsing DMOZ dumps for category queries in Python

I am currently working on a project which involves finding the 'domains of knowledge' a certain key-word is related to. I plan to do this using DMOZ. For example, 'Brad Pitt' gives

Arts: People: P: Pitt, Brad: Fan Pages (10)

Arts: People: P: Pitt, Brad: Articles and Interviews (5)

Arts: People: P: Pitt, Brad (4)

Arts: People: P: Pitt, Brad: Image Galleries (2)

Arts: People: P: Pitt, Brad: Movies (2)

and so on...

I have the structure.rdf.u8 dump from DMOZ website. Someone had mentioned to me that if I do not need the URLs, just this file is enough(I don't need the websites, only the categories pertaining to keywords). Or do I need the content file also?

Moreover, I would like to know the best way to parse the structure file using Python (any library). I don't have any knowledge of XML, though I am good with Python.

I started with https://github.com/kremso/dmoz-parser and made a simple topic filter : https://github.com/lawrencecreates/dmoz-parser/blob/master/sample.py#L6

class LawrenceFilter:
  def __init__(self):
    self._file = open("seeds.txt", 'w')

  def page(self, page, content):
      if page != None and page != "":
          topic = content['topic']
          if topic.find('United_States/Kansas/Localities/L/Lawrence') > 0 :
              self._file.write(page + "\n")
              print "found page %s in topic %s" % (page , topic)

  def finish(self):
    self._file.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM