简体   繁体   English

具有动态列数的平面文件规范化

[英]Flat file normalization with a dynamic number of columns

I have a flat file with an unfortunately dynamic column structure. 我有一个平面文件,不幸的是动态列结构。 There is a value that is in a hierarchy of values, and each tier in the hierarchy gets its own column. 值中包含一个值,层次结构中的每个层都有自己的列。 For example, my flat file might resemble this: 例如,我的平面文件可能类似于:

StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Tier3ObjectId|Status
1234|7890|abcd|efgh|ijkl|mnop|Pending
...

The same feed the next day may resemble this: 第二天相同的饲料可能类似于:

StatisticID|FileId|Tier0ObjectId|Tier1ObjectId|Tier2ObjectId|Status
1234|7890|abcd|efgh|ijkl|Complete
...

The thing is, I don't care much about all the tiers; 问题是,我并不关心所有层级; I only care about the id of the last (bottom) tier, and all the other row data that is not a part of the tier columns. 我只关心最后(底部)层的id,以及不属于层列的所有其他行数据。 I need normalize the feed to something resembling this to inject into a relational database: 我需要将feed标准化为类似于此的东西以注入关系数据库:

StatisticID|FileId|ObjectId|Status
1234|7890|ijkl|Complete
...

What would be an efficient, easy-to-read mechanism for determining the last tier object id, and organizing the data as described? 什么是一种有效的,易于阅读的机制,用于确定最后一层的对象ID,并按照描述组织数据? Every attempt I've made feels kludgy to me. 我所做的每一次尝试都让我感到尴尬。

Some things I've done: 我做过的一些事情:

  • I have tried to examine the column names for regular expression patterns, identify the columns that are tiered, order them by name descending, and select the first record... but I lose the ordinal column number this way, so that didn't look good. 我试图检查正则表达式模式的列名,识别分层的列,按名称降序排序,然后选择第一条记录......但是我这样丢失了序数列号,所以看起来没那么好。
  • I have placed the columns I want into an IDictionary<string, int> object to reference, but again reliably collecting the ordinal of the dynamic columns is an issue, and it seems this would be rather non-performant. 我已经将我想要的列放入IDictionary<string, int>对象中进行引用,但是再次可靠地收集动态列的序数是一个问题,而且看起来这似乎是非高效的。

I ran into a simular problem a few years ago. 几年前我遇到了一个类似的问题。 I used a Dictionary to map the columns, it was not pretty, but it worked. 我使用字典来映射列,它不漂亮,但它工作。

First make a Dictionary: 首先制作一个词典:

private Dictionary<int, int> GetColumnDictionary(string headerLine)
    {
        Dictionary<int, int> columnDictionary = new Dictionary<int, int>();
        List<string> columnNames = headerLine.Split('|').ToList();

        string maxTierObjectColumnName = GetMaxTierObjectColumnName(columnNames);
        for (int index = 0; index < columnNames.Count; index++)
        {
            if (columnNames[index] == "StatisticID")
            {
                columnDictionary.Add(0, index);
            }

            if (columnNames[index] == "FileId")
            {
                columnDictionary.Add(1, index);
            }

            if (columnNames[index] == maxTierObjectColumnName)
            {
                columnDictionary.Add(2, index);
            }

            if (columnNames[index] == "Status")
            {
                columnDictionary.Add(3, index);
            }
        }

        return columnDictionary;
    }

    private string GetMaxTierObjectColumnName(List<string> columnNames)
    {
        // Edit this function if Tier ObjectId is greater then 9
        var maxTierObjectColumnName = columnNames.Where(c => c.Contains("Tier") && c.Contains("Object")).OrderBy(c => c).Last();

        return maxTierObjectColumnName;
    }

And after that it's simply running thru the file: 之后它只是通过文件运行:

private List<DataObject> ParseFile(string fileName)
    {
        StreamReader streamReader = new StreamReader(fileName);

        string headerLine = streamReader.ReadLine();
        Dictionary<int, int> columnDictionary = this.GetColumnDictionary(headerLine);

        string line;
        List<DataObject> dataObjects = new List<DataObject>();
        while ((line = streamReader.ReadLine()) != null)
        {
            var lineValues = line.Split('|');

            string statId = lineValues[columnDictionary[0]];
            dataObjects.Add(
                new DataObject()
                {
                    StatisticId = lineValues[columnDictionary[0]],
                    FileId = lineValues[columnDictionary[1]],
                    ObjectId = lineValues[columnDictionary[2]],
                    Status = lineValues[columnDictionary[3]]
                }
            );
        }

        return dataObjects;
    }

I hope this helps (even a little bit). 我希望这有助于(甚至一点点)。

Personally I would not try to reformat your file. 就个人而言,我不会尝试重新格式化您的文件。 I think the easiest approach would be to parse each row from the front and the back. 我认为最简单的方法是从前面后面解析每一行。 For example: 例如:

itemArray = getMyItems();
statisticId = itemArray[0];
fileId = itemArray[1];
//and so on for the rest of your pre-tier columns

//Then get the second to last column which will be the last tier
lastTierId = itemArray[itemArray.length -1];

Since you know the last tier will always be second from the end you can just start at the end and work your way forwards. 既然你知道最后一层将始终是第二层,你可以从最后开始,继续前进。 This seems like it would be much easier than trying to reformat the datafile. 这似乎比尝试重新格式化数据文件容易得多。

If you really want to create a new file, you could use this approach to get the data you want to write out. 如果您确实想要创建新文件,可以使用此方法获取要写出的数据。

I don't know C# syntax, but something along these lines: 我不知道C#语法,但是沿着这些方向:

  1. split line in parts with | 用|分割部分分割线 as separator 作为分隔符
  2. get parts [0], [1], [length - 2] and [length - 1] 得到零件[0],[1],[长度 - 2]和[长度 - 1]
  3. pass the parts to the database handling code 将部件传递给数据库处理代码

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM