简体   繁体   English

如何从Excel工作表中删除字符?

[英]How to remove characters from an excel sheet?

My overall problem is that I have a large Excel file(Column AS, 85000 rows) that I want to convert to XML. 我的整体问题是我有一个大型Excel文件(列AS,85000行),我想将其转换为XML。 The data in the cells is all text. 单元格中的数据都是文本。

The process I'm using now is to manually save the excel file as csv, then parse that in my own c# program to turn it into XML. 我现在使用的过程是手动将excel文件保存为csv,然后在我自己的c#程序中解析它以将其转换为XML。 If you have better recommendations, please recommend. 如果您有更好的建议,请推荐。 I've searched SO and the only fast methods I found for converting straight to XML require my data to be all numeric. 我搜索了SO,我发现直接转换为XML的唯一快速方法要求我的数据都是数字。 (Tried reading cell by cell, would have taken 3 days to process) (尝试逐个细胞阅读,需要3天才能处理)

So, unless you can recommend a different way for me to approach the problem, I want to be able to programmatically remove all commas, <, >, ', and " from the excel sheet. 所以,除非你能为我提出一个不同的方法来解决问题,否则我希望能够以编程方式从excel表中删除所有逗号<,>,'和“。

There are many options to read/edit/create Excel files: 有许多选项可以读取/编辑/创建Excel文件:

MS provides the free OpenXML SDK V 2.0 - see http://msdn.microsoft.com/en-us/library/bb448854%28office.14%29.aspx (XLSX only) MS提供免费的OpenXML SDK V 2.0 - 请参阅http://msdn.microsoft.com/en-us/library/bb448854%28office.14%29.aspx (仅限XLSX)

This can read+write MS Office files (including Excel). 这可以读取+写入MS Office文件(包括Excel)。

Another free option see http://www.codeproject.com/KB/office/OpenXML.aspx (XLSX only) 另一个免费选项见http://www.codeproject.com/KB/office/OpenXML.aspx (仅限XLSX)

IF you need more like handling older Excel versions (like XLS, not only XLSX), rendering, creating PDFs, formulas etc. then there are different free and commercial libraries like ClosedXML (free, XLSX only), EPPlus (free, XLSX only), Aspose.Cells , SpreadsheetGear , LibXL and Flexcel etc. 如果你需要更多像处理旧的Excel版本(如XLS,不仅仅是XLSX),渲染,创建PDF,公式等,那么有不同的免费和商业库,如ClosedXML (免费,仅限XLSX), EPPlus (免费,仅限XLSX) , Aspose.CellsSpreadsheetGearLibXLFlexcel等。

Another option is Interop which requires Excel to be installed locally BUT Interop is not supported in sever-scenarios by MS . 另一个选项是Interop,它要求在本地安装Excel但MS在服务器方案中不支持Interop

Any library-based approach to deal with the Excel-file directly is way faster than Interop in my experience... 在我的经验中,任何基于库的直接处理Excel文件的方法都比Interop快得多......

I would use a combination of Microsoft.Office.Interop.Excel and XmlSerializer to get the job done. 我会使用Microsoft.Office.Interop.ExcelXmlSerializer的组合来完成工作。

This is in light of the fact that a) you're using a console appilcation, and b) the interop assemblies are easy to integrate to the solution (just References->Add ). 这是因为a)您正在使用控制台应用程序,以及b)互操作程序集很容易集成到解决方案中(只需参考 - >添加 )。

I'm assuming that you have a copy of Excel installed in the machine runnning the process ( you mentioned you manually open the workbook currently, hence the assumption ). 我假设您在运行该过程的机器中安装了Excel副本( 您提到您当前手动打开工作簿,因此假设 )。

The code would look something like this: 代码看起来像这样:

The serializable layer : 可序列化层

public class TestClass
{
    public List<TestLineItem> LineItems { get; set; }

    public TestClass()
    {
        LineItems = new List<TestLineItem>();
    }
}

public class TestLineItem
{
    private string SanitizeText(string input)
    {
        return input.Replace(",", "")
            .Replace(".", "")
            .Replace("<", "")
            .Replace(">", "")
            .Replace("'", "")
            .Replace("\"", "");
    }

    private string m_field1;
    private string m_field2;

    public string Field1 
    {
        get { return m_field1; }
        set { m_field1 = SanitizeText(value); }
    }

    public string Field2 
    {
        get { return m_field2; }
        set { m_field2 = SanitizeText(value); }
    }

    public decimal Field3 { get; set; }

    public TestLineItem() { }

    public TestLineItem(object field1, object field2, object field3)
    {
        m_field1 = (field1 ?? "").ToString();
        m_field2 = (field2 ?? "").ToString();

        if (field3 == null || field3.ToString() == "")
            Field3 = 0m;
        else
            Field3 = Convert.ToDecimal(field3.ToString());
    }
}

Then open the worksheet and load into a 2D array : 然后打开工作表并加载到2D数组中

// using OExcel = Microsoft.Office.Interop.Excel;
var app = new OEXcel.Application();
var wbPath = Path.Combine(
    Environment.GetFolderPath(
        Environment.SpecialFolder.MyDocuments), "Book1.xls");

var wb = app.Workbooks.Open(wbPath);
var ws = (OEXcel.Worksheet)wb.ActiveSheet;

// there are better ways to do this... 
// this one's just off the top of my head
var rngTopLine = ws.get_Range("A1", "C1");
var rngEndLine = rngTopLine.get_End(OEXcel.XlDirection.xlDown);
var rngData = ws.get_Range(rngTopLine, rngEndLine);
var arrayData = (object[,])rngData.Value2;

var tc = new TestClass();

// since you're enumerating an array, the operation will run much faster
// than reading the worksheet line by line.
for (int i = arrayData.GetLowerBound(0); i <= arrayData.GetUpperBound(0); i++)
{
    tc.LineItems.Add(
        new TestLineItem(arrayData[i, 1], arrayData[i, 2], arrayData[i, 3]));
}

var xs = new XmlSerializer(typeof(TestClass));
var fs = File.Create(Path.Combine(
    Environment.GetFolderPath(Environment.SpecialFolder.MyDocuments),
    "Book1.xml"));
xs.Serialize(fs, tc);

wb.Close();
app.Quit();

The generated XML output will look something like this : 生成的XML输出将如下所示

<TestClass>
  <LineItems>
    <TestLineItem>
      <Field1>test1</Field1>
      <Field2>some&amp;lt;encoded&amp;gt; stuff here</Field2>
      <Field3>123456.789</Field3>
    </TestLineItem>
    <TestLineItem>
      <Field1>test2</Field1>
      <Field2>testing some commas, and periods.</Field2>
      <Field3>23456789.12</Field3>
    </TestLineItem>
    <TestLineItem>
      <Field1>test3</Field1>
      <Field2>text in &amp;quot;quotes&amp;quot; and &amp;#39;single quotes&amp;#39;</Field2>
      <Field3>0</Field3>
    </TestLineItem>
  </LineItems>
</TestClass>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM