简体   繁体   中英

importing data from WARC files (WebArchive)

I'm dealing with a not-so-normal use case where data is present in WARC files. [ https://en.wikipedia.org/wiki/Web_ARChive][1] And i want to import the data into Neo4j.

One solution i can think of is to parse the WARC file (some java code to read), then write structured data into CSV so that it can then be loaded using some import tool.

Is extracting into CSV the only option to load data into Neo4j?

Could you give me some advise on how to go about implementing this use case?


Thanks,
Phaneendra

It depends.

It depends on what data you want to load from the Web Archive. If you're talking about loading the metadata ... then you do not need the intermediate step, process the file and insert the data straight into the database. You could use a stored procedure for that (apoc library is full of similar things) or a small server application using your favorite language + driver.

If you're talking about the content inside the Web Archive, it's a different story. Neo4j is not a blob/document store so you would have to extract and interpret the archived files. That would probably be more efficient in an indirect process.

Hope this helps, Tom

BTW csv is not the only format that can be loaded. There are procedures for loading xml, json, ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM