简体   繁体   English

需要有关如何从.docx / .doc文件中提取数据然后进入SQL Server的建议

[英]Need suggestions on how to extract data from .docx/.doc file then into SQL Server

I'm suppose to develop an application for my project, it will load past-year examination / exercises paper (word file), detect the sections accordingly, extract the questions and images in that section, and then store the questions and images into the database. 我想为我的项目开发一个应用程序,它将加载过去一年的考试/练习论文(word文件),相应地检测部分,提取该部分中的问题和图像,然后将问题和图像存储到数据库。 (Preview of the question paper is at the bottom of this post) (问题报告的预览位于本文的底部)

So I need some suggestions on how to extract data from a word file, then inserting them into a database. 所以我需要一些关于如何从word文件中提取数据,然后将它们插入数据库的建议。 Currently I have a few methods to do so, however I have no idea how I could implement them when the file contains textboxes with background image. 目前我有一些方法可以这样做,但是当文件包含带有背景图像的文本框时,我不知道如何实现它们。 The question has to link with the image. 问题必须与图像联系起来。

Method One (Make use of ms office interop) 方法一(利用ms office互操作)

  • Load the word file -> Extract image, save into a folder -> Extract text, save as .txt -> Extract text from .txt then store in db 加载单词文件 - >提取图像,保存到文件夹 - >提取文本,另存为.txt - >从.txt中提取文本然后存储在db中

Questions: 问题:

  • How do I detect the section and question? 如何检测部分和问题?
  • How do I link the image to the question? 如何将图像链接到问题?

Extract text from word file (Working): 从word文件中提取文本(工作):

private object missing = Type.Missing;
private object sFilename = @"C:\temp\questionpaper.docx";
private object sFilename2 = @"C:\temp\temp.txt";
private object readOnly = true;
object fileFormat = Word.WdSaveFormat.wdFormatText;

private void button1_Click(object sender, EventArgs e)
{
   Word.Application wWordApp = new Word.Application();
   wWordApp.DisplayAlerts = Word.WdAlertLevel.wdAlertsNone;
   Word.Document dFile = wWordApp.Documents.Open(ref sFilename,
                            ref missing, ref readOnly, ref missing, ref missing,
                            ref missing, ref missing, ref missing, ref missing,
                            ref missing, ref missing, ref missing, ref missing, 
                            ref missing, ref missing, ref missing);

   dFile.SaveAs(ref sFilename2, ref fileFormat, ref missing, ref missing, 
            ref missing, ref missing, ref missing, ref missing,ref missing,
            ref missing,ref missing,ref missing,ref missing,ref missing,
            ref missing,ref missing);
   dFile.Close(ref missing, ref missing, ref missing);
}

Extract image from word file (doesn't work on image inside textbox): 从word文件中提取图像(对文本框内的图像不起作用):

private Word.Application wWordApp;
private int m_i;
private object missing = Type.Missing;
private object filename = @"C:\temp\questionpaper.docx";
private object readOnly = true;

private void CopyFromClipbordInlineShape(String imageIndex)
{
   Word.InlineShape inlineShape = wWordApp.ActiveDocument.InlineShapes[m_i];
   inlineShape.Select();
   wWordApp.Selection.Copy();
   Computer computer = new Computer();
   if (computer.Clipboard.GetDataObject() != null)
   {
      System.Windows.Forms.IDataObject data = computer.Clipboard.GetDataObject();
      if (data.GetDataPresent(System.Windows.Forms.DataFormats.Bitmap))
      {
         Image image = (Image)data.GetData(System.Windows.Forms.DataFormats.Bitmap, true);
         image.Save("C:\\temp\\DoCremoveImage" + imageIndex + ".png", System.Drawing.Imaging.ImageFormat.Png);
      }
   }
}

private void button1_Click(object sender, EventArgs e)
{
    wWordApp = new Word.Application();
    wWordApp.Documents.Open(ref filename,
                                ref missing, ref readOnly, ref missing, ref missing,
                                ref missing, ref missing, ref missing, ref missing,
                                ref missing, ref missing, ref missing, ref missing, 
                                ref missing, ref missing, ref missing);
    try
    {
       for (int i = 1; i <= wWordApp.ActiveDocument.InlineShapes.Count; i++)
       {
          m_i = i;
          CopyFromClipbordInlineShape(Convert.ToString(i));
       }
    }
    finally
    {
       object save = false;
       wWordApp.Quit(ref save, ref missing, ref missing);
       wWordApp = null;
    }
 }

Method Two 方法二

  • Unzip the word file (.docx) -> Copy the media(image) folder, store somewhere -> Parse the XML file -> Store the text in db 解压缩word文件(.docx) - >复制media(image)文件夹,存储到某处 - >解析XML文件 - >将文本存储在db中

Any suggestion/help would be greatly appreciated :D 任何建议/帮助将不胜感激:D

Preview of the word file: word文件的预览: 单词文件的预览 (backup link: http://i.stack.imgur.com/YF1Ap.png ) (备份链接: http//i.stack.imgur.com/YF1Ap.png

The answer is choice #3 - the OpenXML SDK. 答案是选择#3 - OpenXML SDK。 First let me explain why you don't want the choices listed above. 首先让我解释一下为什么你不想要上面列出的选择。

  1. Running Office on the server is a bad idea. 在服务器上运行Office是个坏主意。 Microsoft specifically says don't do it. 微软明确表示不要这样做。 It's slow and you will hit "issues" where it throws exceptions or just fails to find things. 它很慢,你会遇到“问题”,它会抛出异常或者找不到东西。

  2. Parsing the XML file will work but the XPath to find every possible case where the images, etc. are located adds up. 解析XML文件将起作用,但XPath可以找到图像等所在的每种可能情况。 You would probably have to iterate on sections, which come at the end of each section, then handle all cases of in a cell, in a textbox, positioned, inline, etc. 您可能必须迭代每个部分末尾的部分,然后处理单元格,文本框,定位,内联等所有情况。

If you go with the OpenXML SDK you have a LINQ interface where you can then use the Descendents and get everything that is an image (or whatever you need). 如果您使用OpenXML SDK,那么您将拥有一个LINQ界面,然后您可以使用后代并获取所有图像(或任何您需要的)。 It also gives you sections by the SectPr node so you can easily iterate over sections. 它还为SectPr节点提供了部分,因此您可以轻松地迭代各个部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM