How to index HDFS pdf files in Solr?

Question

hadoop jar jobjar/hadoop/hadoop-lws-job-1.2.0-0-0.jar com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true -DcsvFieldMapping=0=id,1=text -cls com.lucidworks.hadoop.ingest.CSVIngestMapper -c hdp1 -i /user/solr/data/csv/mydata.csv -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s http://localhost:8983/solr

I've tried using code above in the command to execute PDF files, but I get undesired output !

com.lucidworks.hadoop.ingest.CSVIngestMapper is used for exclusive CSV files, so is there anything similar to this for "PDF files"? Looking forward for your assistance.

Answer 1

You should use the DirectoryIngestMapper:

hadoop jar jobjar/hadoop/hadoop-lws-job-1.2.0-0-0.jar 
com.lucidworks.hadoop.ingest.IngestJob -Dlww.commit.on.close=true    
com.lucidworks.hadoop.ingest.DirectoryIngestMapper -c hdp1 -i 
/user/solr/data/pdf/*.pdf -of com.lucidworks.hadoop.io.LWMapRedOutputFormat -s 
http://localhost:8983/solr

Assuming that /user/solr/data/pdf/*.pdf is where your pdfs are.

How to index HDFS pdf files in Solr?

Question

1 answers

solution1
0 2015-05-07 22:36:49

How to index HDFS pdf files in Solr?

Question

1 answers

solution1 0 2015-05-07 22:36:49

solution1
0 2015-05-07 22:36:49