I've been given an assignment at university that consists of storing PDF documents efficiently in a PDF store and only once (no content duplication by uploading the same file multiple times).
The method being the following store(String title, File pdfFile)
Example 1:
"Fast Cars", fastcars.pdf
"Even Faster Cars", fastcars.pdf
"Not So Fast Cars", cars.pdf
"Slow Cars", slowcars.pdf
Expected Result: It should have a size of 3 containing the following fastcars.pdf, cars.pdf and slowcars.pdf
Example 2:
"Fast Cars", fastcars.pdf
"Even Faster Cars", fastcars.pdf
"Fast Cars", sportscars.pdf
"Even Faster Cars", sportscars.pdf
It should have size 1 and only containing sportscars.pdf
My idea is to content hash the pdf and possibly use a HashMap mapping the content digest hash with a random integer and later mapping that to the PDF title?
The tricky part is trying to satisfy Example 2.
What data structure would you recommend for this problem for efficiency and what approach would you take?
Thanks in advance
I took the console input ..
testcase#1 i/p:
FastCars fastcars.pdf
EvenFasterCars fastcars.pdf
NotSoFastCars cars.pdf
SlowCars slowcars.pdf
o/p:
slowcars.pdf
fastcars.pdf
cars.pdf
testcase#2
i/p:
FastCars fastcars.pdf
EvenFasterCars fastcars.pdf
FastCars sportscars.pdf
EvenFasterCars sportscars.pdf
o/p:
sportscars.pdf
public static void main(String[] args) throws Exception {
Map<String,String> map1=new HashMap<String,String>();
Map<String,String> map2=new HashMap<String,String>();
BufferedReader br=new BufferedReader(new InputStreamReader(System.in));
for(int i=0;i<4;i++)
{
String inpt[]=br.readLine().split(" ");
String tag=inpt[0];
String fileName=inpt[1];
map1.put(tag,fileName);
map2.put(fileName, tag);
}
Set<String> keySet=map1.keySet();
Iterator it=keySet.iterator();
while(it.hasNext())
{
String key=(String)it.next();
if(map2.containsKey(map1.get(key)))
{
System.out.println(map1.get(key));
map2.remove(map1.get(key));
}
}
}
Every conforming PDF file has a unique ID as part of it's metadata. You might want to just use that string as the file name. Most PDF library tools allow easy access to this metadata.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.