解析纯文本表

Question

我正在尝试以纯文本格式解析表。 该程序是使用C＃在Visual Studio中编写的。 我需要解析表并将数据插入数据库。

以下是我将要阅读的示例表：

ID    Name          Value1        Value2         Value3       Value4  //header
1     nameA         3.0           0.2            2            6.2
2     nameB
3     nameC         2.9                          3.0          7.3
4     nameD         1.5           3.0            1.8          1.1
5     nameE
6     nameF      1.2        2.4          3.3           2.5
7     nameG      3.0        3.2          2.1           4.5
8     nameH                 88           12.4          28.9

在示例中，我将需要捕获ID为1、3、4、6、7和8的数据。

我想到了两种方法来解决这个问题，但是它们都不是100％有效。

方法1：

通过阅读标题，我可以获取每一列的起始索引。 然后，我将使用Substring为每一行收集数据。

问题：一旦超过某一行（这种情况发生时我将不知道），列将移动，并且Substring将不再收集正确的数据。

此方法将仅收集1、3和4的正确数据。

方法2：

使用正则Regex收集所有匹配项。 我希望它可以按此顺序收集ID，Name，Value1，Value2，Value3，Value4。

我的模式是(\\d*?)\\s\\s\\s+(.*?)\\s\\s\\s+(\\d*\\.*\\d*)\\s\\s\\s+(\\d*\\.*\\d*)\\s\\s\\s+(\\d*\\.*\\d*)\\s\\s\\s+(\\d*\\.*\\d*)

问题：收集的数据向左移动了一些行。 例如，在ID 3上， Value2应该为空，但是正则表达式将读取Value2 = 3.0 ， Value3 = 7.3和Value4 = blank 。 ID 8也是如此。

题：

如何读取整个表格并正确解析它们？

（1）我不知道值将从哪一行开始偏移，并且

（2）我不知道它将移动多少个单元格以及它们是否一致。

附加信息

该表位于PDF文件中，我将PDF转换为文本文件，以便可以读取数据。 当一个表跨越多个页面时，将发生移位数据，但是不一致。

编辑

以下是一些实际数据：

68                        BENZYL ALCOHOL                               6.0                            0.4           1                  7.4

91                        EVERNIA PRUNASTRI (OAK MOSS)                 34                             3             3                  10

22                        test                                                                        2323          23                 12

Answer 1

如何将此文件视为固定长度文件，您可以在其中按索引和长度定义每列。 定义固定长度的列后，只需使用Substring获取列的值，然后使用Trim清理即可。

您可以将所有这些包装在Linq语句中，以投影为匿名类型并过滤所需的ID。

像这样：

static void Main(string[] args)
{
    int[] select = new int[] { 1, 3, 4, 6, 7, 8 };
    string[] lines = File.ReadAllLines("TextFile1.txt");

    var q = lines.Skip(1).Select(l => new {
        Id = Int32.Parse(GetValue(l, 0, 6)),
        Name = GetValue(l, 6, 11),
        Value1 = GetValue(l, 17, 11),
        Value2 = GetValue(l, 28, 13),
        Value3 = GetValue(l, 41, 14),
        Value4 = GetValue(l, 55, 13),
    }).Where(o => select.Contains(o.Id));

    var r = q.ToArray();        
}

static string GetValue(string line, int index, int length)
{
    string value = null;
    int lineLength = line.Length;

    // Take as much of the line as we can up to column length
    if(lineLength > index)            
        value = line.Substring(index, Math.Min(length, lineLength - index)).Trim();

    // Return null if we just have whitespace
    return String.IsNullOrWhiteSpace(value) ? null : value;
}

Answer 2

好的，你去！ 使用此正则表达式模式：

注意：您必须将其匹配到任何一行，而不是整个文档！ 如果要对整个文档执行此操作，则必须添加“多行”修饰符（m）。 您可以通过在正则表达式模式的开头添加(?m)来实现！

编辑：

您提供了一些实际数据行。 这是我更新的正则表达式模式：

^(?<id>\d+)(?:\s{2,25})(?<name>.+?)(?:\s{2,45})(?<val1>\d+(?:\.\d+)?)?(?:\s{2,33})(?<val2>\d+(?:\.\d+)?)?(?:\s{2,14})(?<val3>\d+(?:\.\d+)?)?(?:\s{2,19})(?<val4>\d+(?:\.\d+)?)?$

解析纯文本表

问题描述

2 个解决方案

解决方案1
1 2014-07-07 17:25:11

解决方案2
1 已采纳 2014-07-07 17:26:01

解析纯文本表

问题描述

2 个解决方案

解决方案1 1 2014-07-07 17:25:11

解决方案2 1 已采纳 2014-07-07 17:26:01

解决方案1
1 2014-07-07 17:25:11

解决方案2
1 已采纳 2014-07-07 17:26:01