简体   繁体   中英

Getting a topology on StormCrawler to properly write warc files

The stormcrawler maven archetype seems to not play nice with the warc module in my project. Currently it only creates empty 0 byte files with names like "crawl-20180802121925-00000.warc.gz". Am I missing something here?

I try to enable warc writing by creating a default project like so:

mvn archetype:generate -DarchetypeGroupId=com.digitalpebble.stormcrawler -DarchetypeArtifactId=storm-crawler-archetype -DarchetypeVersion=1.10

And then adding the dependency to the warc module in the pom.xml like so

    <dependency>
        <groupId>com.digitalpebble.stormcrawler</groupId>
        <artifactId>storm-crawler-warc</artifactId>
        <version>1.10</version>
    </dependency>

And then I add the WARCHdfsBolt to the fetch grouping, while trying to write to a local filesystem directory.

public class CrawlTopology extends ConfigurableTopology {

    public static void main(String[] args) throws Exception {
        ConfigurableTopology.start(new CrawlTopology(), args);
    }

    @Override
    protected int run(String[] args) {
        TopologyBuilder builder = new TopologyBuilder();

        String[] testURLs = new String[] { "http://www.lequipe.fr/",
                "http://www.lemonde.fr/", "http://www.bbc.co.uk/",
                "http://storm.apache.org/", "http://digitalpebble.com/" };

        builder.setSpout("spout", new MemorySpout(testURLs));

        builder.setBolt("partitioner", new URLPartitionerBolt())
                .shuffleGrouping("spout");

        builder.setBolt("fetch", new FetcherBolt())
                .fieldsGrouping("partitioner", new Fields("key"));

        builder.setBolt("warc", getWarcBolt())
                .localOrShuffleGrouping("fetch");

        builder.setBolt("sitemap", new SiteMapParserBolt())
                .localOrShuffleGrouping("fetch");

        builder.setBolt("feeds", new FeedParserBolt())
                .localOrShuffleGrouping("sitemap");

        builder.setBolt("parse", new JSoupParserBolt())
                .localOrShuffleGrouping("feeds");

        builder.setBolt("index", new StdOutIndexer())
                .localOrShuffleGrouping("parse");

        Fields furl = new Fields("url");

        // can also use MemoryStatusUpdater for simple recursive crawls
        builder.setBolt("status", new StdOutStatusUpdater())
                .fieldsGrouping("fetch", Constants.StatusStreamName, furl)
                .fieldsGrouping("sitemap", Constants.StatusStreamName, furl)
                .fieldsGrouping("feeds", Constants.StatusStreamName, furl)
                .fieldsGrouping("parse", Constants.StatusStreamName, furl)
                .fieldsGrouping("index", Constants.StatusStreamName, furl);

        return submit("crawl", conf, builder);
    }

    private WARCHdfsBolt getWarcBolt() {
        String warcFilePath = "/Users/user/Documents/workspace/test/warc";

        FileNameFormat fileNameFormat = new WARCFileNameFormat()
                .withPath(warcFilePath);

        Map<String,String> fields = new HashMap<>();
        fields.put("software:", "StormCrawler 1.0 http://stormcrawler.net/");
        fields.put("conformsTo:", "http://www.archive.org/documents/WarcFileFormat-1.0.html");

        WARCHdfsBolt warcbolt = (WARCHdfsBolt) new WARCHdfsBolt()
                .withFileNameFormat(fileNameFormat);
        warcbolt.withHeader(fields);

        // can specify the filesystem - will use the local FS by default
//        String fsURL = "hdfs://localhost:9000";
//        warcbolt.withFsUrl(fsURL);

        // a custom max length can be specified - 1 GB will be used as a default
        FileSizeRotationPolicy rotpol = new FileSizeRotationPolicy(50.0f,
                FileSizeRotationPolicy.Units.MB);
        warcbolt.withRotationPolicy(rotpol);
        return warcbolt;
    }
}

Whether I run it locally with or without flux, doesn't seem to make a difference. You can have a look at the demo repo here: https://github.com/keyboardsamurai/storm-test-warc

Thanks for asking this. In theory content gets written to the WARC files when

  1. there is an explicit sync as set in the sync policy which we have by default at 10 tuples
  2. there's an automatic one which happens via tick tuples every 15 secs by default
  3. the file is rotated - in your case this should happen when the content reaches 50MB

Since the topology you are using as a starting point is not recursive and does not process more than 5 URLs, the conditions 1 and 3 are never met.

You can change that by using

builder.setBolt("status", new MemoryStatusUpdater())

instead. This way new URLs will be processed continuously. Alternatively, you can add

warcbolt.withSyncPolicy(new CountSyncPolicy(1));

to your code so that the synchronization is triggered after every tuple. In practice, you wouldn't need to do that on a real crawl where URLs are coming constantly.

Now the weird thing is that regardless of whether the sync is triggered by condition 1 or 2, I can't see any change to the file at all and it remains at 0 bytes. This is not the case with version 1.8

<dependency>
    <groupId>com.digitalpebble.stormcrawler</groupId>
    <artifactId>storm-crawler-warc</artifactId>
    <version>1.8</version>
</dependency> 

so it could be due to a change in the code after that.

I know that some users have been relying on FileTimeSizeRotationPolicy, which can trigger condition 3 above based on time.

Feel free to open an issue on Github, I'll have a closer look at it (when I am back next month).

EDIT : there was a bug with the compression of the entries which has now been fixed and will be part of the next SC release.

See comments on the issue kindly posted by the OP.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM