简体   繁体   中英

how to obtain URLs from Dmoz ODP

I want to use a database of URLs present in DMOZ ODP for my application. ( an array of URL strings OR a file containing the same ). Is there any way of obtaining it , ( other than the manual copy-paste ) ?

EDIT :

Is there any script / code to parse the rdf file..

Take a look at http://rdf.dmoz.org/ , you'll need to find a way to parse the RDF into your database.

I did this the other day using the odp2db scripts from Steve's Software . They're old, but the format hasn't changed significantly so they work fine.

I found I didn't need to do the iconv and xmlclean.pl steps suggested in the readme, just uncompressed the dumps and ran the structure2db.pl and content2db.pl scripts. You'll need to create the database tables manually (see the SQL at top of script for that) and modify the connection details in the scripts before you start.

With the mid-January 2009 dump I used, there's 756,962 categories and 4,436,796 websites. It took a while to run through them all, but not excessively long, though I did dispense with the site descriptions as I didn't need them. Also, may be worth adding database indices after creating the tables to speed access up later. The raw structure and content files were 75MB and 300MB compressed respectively. 848MB and 2GB respectively.

I've actually done this in java. I just used the SAX API to read through the RDF files. It was pretty straight forward. In my case I wanted to pull out every URL that was in a topic with "Weblogs" in the topic name.

Basically what did was implement a org.xml.sax.helpers.DefaultHandler

Then to setup the code you do:

       InputSource is = new InputSource(new FileInputStream("filename.rdf"));
       XMLReader r = XMLReaderFactory.createXMLReader();
       r.setContentHandler(new MyHandlerClass());
       r.parse(is);

and that's pretty much it. In my handler class I had to implement:

  • startElement(String uri, String localName, String qName, Attributes attributes) then I had an if statement to see if it was an "ExternalPage" tag, in which case I went to another state to look for "topic","Title" and "Description". I had another

  • characters(char[] ch, int start, int length) where I read in the topic, title, and description text depending on which one had been most recently sent to startElement

  • endElement(String uri, String localName, String qName) where I checked to see which element was ending, and if it ExternalPage, that meant the end of the current element.

The whole thing was 80-90 lines of code for the basic parsing. So pretty easy to write. It was able to chew through the multi-gigabyte files in... I don't remember maybe a minute or two? If you just want to query out some specific data, it might be easier just to write the code to do that in your handler, rather then trying to load it into a DB.

If you find a tool that works well, that's obviously better then writing your own code. But writing your own code isn't hard! RDF is just an XML format, and it's not nested or anything. A simple SAX parser is easily doable in a day or so.

您可以随时向那里的一位臭名昭著的编辑付费,他们会为您提供帮助:)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM