简体繁体 English

如何将Word文档/ pdf /图像的部分（每页多个部分，多页）提取为单独的图像/ Word文档/ pdfs？

[英]How do I extract sections (multiple sections per page, multiple pages) of a word document/pdf/image as separate images/word documents/pdfs?

原文 2010-06-30 10:15:51 5 2 c#/ java/ c++/ pdf/ image

Here's the basic problem: I have about 10,000 word documents that contain blocks of data. 这是基本问题：我大约有10,000个包含数据块的word文档。 Each block is numbered and also has an accompanying image. 每个块都有编号，并且还带有一个图像。 I need to somehow store these individual blocks to a db as images (text would be great, but read note below), without the numbering. 我需要以某种方式将这些单独的块作为图像存储到数据库（文本会很好，但是请阅读下面的注释），而无需编号。

I can go through and have typists mark the beginning and ends of the blocks using a ###QUESTIONSTART###, ###QUESTIONEND### or whatever. 我可以使用### QUESTIONSTART ###，### QUESTIONEND ###或其他方法来让打字员标记块的开始和结束。 I am trying to take that document, convert it to a big image, look for those tags, extract the part in between the tags as an image and then move on to the next block. 我正在尝试获取该文档，将其转换为大图像，查找那些标签，将标签之间的部分提取为图像，然后移至下一个块。

I've been looking at some APIs and I think I can definitely crop the images once I figure out how to get the coordinates of each start/end marker. 我一直在研究一些API，我认为一旦弄清楚如何获取每个开始/结束标记的坐标，就可以肯定可以裁剪图像。 Any suggestions? 有什么建议么？ I'd hate to write a pixel by pixel matcher that has to go O(no of blocks * n^2) 我不愿写一个逐个像素匹配器，该匹配器必须为O（块数* n ^ 2）

NOTE: These blocks contain complex equations/math type stuff hence the images. 注意：这些块包含复杂的方程式/数学类型的内容，因此包含图像。 I don't have the $$ to get 1000 typists trained in TeX and retype the whole deal. 我没有$$可以让1000名打字员接受TeX培训并重新输入整个交易。 OCR doesn't cut it yet. OCR尚未削减。

2 个解决方案

我无法理解您的所有问题，但在我看来， Tika可以为您提供帮助。

If you can have typists add block marks to 10,000 documents, why can't the typists 如果您可以让打字员在10,000个文档中添加方框标记，为什么打字员不能

Open the Word document 打开Word文档
Copy the image from the Word document 复制Word文档中的图像
Paste the image into Paint 将图像粘贴到Paint中
Save the image to their disk? 将映像保存到他们的磁盘上？

You can come up with a image naming scheme that makes sense to you and your typists. 您可以提出一种对您和您的打字员有意义的图像命名方案。

Then you can collect the images from the disk drives with a program and load them into your database. 然后，您可以使用程序从磁盘驱动器中收集映像并将其加载到数据库中。

复制Word文档Sections的内容 - Copy content of word document Sections

从具有多个页面的多个 word 文档中读取并通过使用 C# 仅选择包含特定单词的某些页面来创建 PDF - Read from multiple word documents with multiple pages and create a PDF by only selecting certain pages containing a specific word using C#

在一个页面上从Word文档打印多页 - print multiple pages from a word document on one page

合并多个 <Body> （xml）Word文档到1个文档 - Merge multiple <Body> (xml) word documents to 1 document

如何使用多个线程来分节处理图像？ - How to use multiple threads to process an image in sections?

如何在uwp中将图像转换为pdf和Word文档？ - How can I convert images to pdf and Word document in uwp?

如何删除巨大PDF中的页面顶部没有特定单词的页面？希望在C＃中 - How can I delete the pages in an enormous PDF that do not contain a certain word at the top of the page? Hopefully in C#

如何将Word文档的所有页面另存为图像？ - How to save all pages of a Word document as image?

根据用户对一系列问题的回答隐藏 Word 文档中的部分 - Hide sections in a word document based on users responses to a series of questions

Word文档多个背景 - Word Document Multiple Backgrounds

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 复制Word文档Sections的内容 - Copy content of word document Sections 从具有多个页面的多个 word 文档中读取并通过使用 C# 仅选择包含特定单词的某些页面来创建 PDF - Read from multiple word documents with multiple pages and create a PDF by only selecting certain pages containing a specific word using C# 在一个页面上从Word文档打印多页 - print multiple pages from a word document on one page 合并多个 <Body> （xml）Word文档到1个文档 - Merge multiple <Body> (xml) word documents to 1 document 如何使用多个线程来分节处理图像？ - How to use multiple threads to process an image in sections? 如何在uwp中将图像转换为pdf和Word文档？ - How can I convert images to pdf and Word document in uwp? 如何删除巨大PDF中的页面顶部没有特定单词的页面？希望在C＃中 - How can I delete the pages in an enormous PDF that do not contain a certain word at the top of the page? Hopefully in C# 如何将Word文档的所有页面另存为图像？ - How to save all pages of a Word document as image? 根据用户对一系列问题的回答隐藏 Word 文档中的部分 - Hide sections in a word document based on users responses to a series of questions Word文档多个背景 - Word Document Multiple Backgrounds

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM