简体   繁体   English

如何将Word文档/ pdf /图像的部分(每页多个部分,多页)提取为单独的图像/ Word文档/ pdfs?

[英]How do I extract sections (multiple sections per page, multiple pages) of a word document/pdf/image as separate images/word documents/pdfs?

Here's the basic problem: I have about 10,000 word documents that contain blocks of data. 这是基本问题:我大约有10,000个包含数据块的word文档。 Each block is numbered and also has an accompanying image. 每个块都有编号,并且还带有一个图像。 I need to somehow store these individual blocks to a db as images (text would be great, but read note below), without the numbering. 我需要以某种方式将这些单独的块作为图像存储到数据库(文本会很好,但是请阅读下面的注释),而无需编号。

I can go through and have typists mark the beginning and ends of the blocks using a ###QUESTIONSTART###, ###QUESTIONEND### or whatever. 我可以使用### QUESTIONSTART ###,### QUESTIONEND ###或其他方法来让打字员标记块的开始和结束。 I am trying to take that document, convert it to a big image, look for those tags, extract the part in between the tags as an image and then move on to the next block. 我正在尝试获取该文档,将其转换为大图像,查找那些标签,将标签之间的部分提取为图像,然后移至下一个块。

I've been looking at some APIs and I think I can definitely crop the images once I figure out how to get the coordinates of each start/end marker. 我一直在研究一些API,我认为一旦弄清楚如何获取每个开始/结束标记的坐标,就可以肯定可以裁剪图像。 Any suggestions? 有什么建议么? I'd hate to write a pixel by pixel matcher that has to go O(no of blocks * n^2) 我不愿写一个逐个像素匹配器,该匹配器必须为O(块数* n ^ 2)

NOTE: These blocks contain complex equations/math type stuff hence the images. 注意:这些块包含复杂的方程式/数学类型的内容,因此包含图像。 I don't have the $$ to get 1000 typists trained in TeX and retype the whole deal. 我没有$$可以让1000名打字员接受TeX培训并重新输入整个交易。 OCR doesn't cut it yet. OCR尚未削减。

我无法理解您的所有问题,但在我看来, Tika可以为您提供帮助。

If you can have typists add block marks to 10,000 documents, why can't the typists 如果您可以让打字员在10,000个文档中添加方框标记,为什么打字员不能

  • Open the Word document 打开Word文档
  • Copy the image from the Word document 复制Word文档中的图像
  • Paste the image into Paint 将图像粘贴到Paint中
  • Save the image to their disk? 将映像保存到他们的磁盘上?

You can come up with a image naming scheme that makes sense to you and your typists. 您可以提出一种对您和您的打字员有意义的图像命名方案。

Then you can collect the images from the disk drives with a program and load them into your database. 然后,您可以使用程序从磁盘驱动器中收集映像并将其加载到数据库中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM