简体   繁体   English

笛卡尔积或 2 个大文本文件的最佳方法

[英]Best approach for Cartesian product or 2 large text files

I have problem where I want to merge 2 large text files together and generate new file with cartesian product of 2 input files.我想将 2 个大文本文件合并在一起并使用 2 个输入文件的笛卡尔积生成新文件时遇到问题。 I do know how code would look but not sure in which language to build such a utility.我确实知道代码的外观,但不确定用哪种语言来构建这样的实用程序。 I have windows server and I'm familiar with C#, Shell script.我有 Windows 服务器,并且熟悉 C#、Shell 脚本。

Note : File1 can be around 20 MB and File2 can contain around 6000 records.注意:File1 可以是大约 20 MB,File2 可以包含大约 6000 条记录。 So what I want to achieve is Copy 20MB data 6000 times in new file.所以我想要实现的是在新文件中复制 20MB 数据 6000 次。

Below are smaller examples of how my files would look like以下是我的文件外观的较小示例

File1文件 1

Head-A-AA-AAA
Child-A1-AA1-AAA1
Child-A2-AA2-AAA2
Child-A3-AA3-AAA3
Head-B-BB-BBB
Child-B1-BB1-BBB1
Child-B2-BB2-BBB2
Child-B3-BB3-BBB3

File2文件 2

Store1
Store2
Store3

Expected output file预期输出文件

Store1
Head-A-AA-AAA
Child-A1-AA1-AAA1
Child-A2-AA2-AAA2
Child-A3-AA3-AAA3
Head-B-BB-BBB
Child-B1-BB1-BBB1
Child-B2-BB2-BBB2
Child-B3-BB3-BBB3
Store2
Head-A-AA-AAA
Child-A1-AA1-AAA1
Child-A2-AA2-AAA2
Child-A3-AA3-AAA3
Head-B-BB-BBB
Child-B1-BB1-BBB1
Child-B2-BB2-BBB2
Child-B3-BB3-BBB3
Store3
Head-A-AA-AAA
Child-A1-AA1-AAA1
Child-A2-AA2-AAA2
Child-A3-AA3-AAA3
Head-B-BB-BBB
Child-B1-BB1-BBB1
Child-B2-BB2-BBB2
Child-B3-BB3-BBB3

Looking for suggestion if C# code with windows service will serve purpose or I need to use any other tool/utility/scripting?寻找带有 Windows 服务的 C# 代码是否有用或者我需要使用任何其他工具/实用程序/脚本的建议?

EDIT : Created below c# code.编辑:在 c# 代码下创建。 But it's taking hours to generate 150 GB output file.但是生成 150 GB 的输出文件需要几个小时。 I'm looking for faster way.我正在寻找更快的方法。 I'm taking content from file 1 and copying it for each record in second file我正在从文件 1 中获取内容并为第二个文件中的每个记录复制它

FileInfo[] fi;
            List<FileInfo> TodaysFiles = new List<FileInfo>();
            string PublishId;
            DirectoryInfo di = new DirectoryInfo(@"\\InputPath");

            fi = di.GetFiles().Where(file => file.FullName.Contains("TRANSMIT_MASS")).ToArray();

            foreach (FileInfo f in fi)
            {
                string[] tokens = f.Name.Split('_');
                if(tokens[2] == DateTime.Now.AddDays(1).ToString("MMddyyyy"))
                {
                    PublishId = tokens[0];
                    string MACSFile = @"\\OutputPath\\" + PublishId + ".txt";
                    string path =f.FullName;

                    string StoreFile = di.GetFiles().Where(file => file.Name.StartsWith(PublishId) && file.Name.Contains("SUBS")).Single().FullName;

                    using (FileStream fs = File.Open(StoreFile, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
                    using (BufferedStream bs = new BufferedStream(fs))
                    using (StreamReader sr = new StreamReader(bs))
                    {
                        using (StreamWriter outfile = new StreamWriter(MACSFile))
                        {
                            String StoreNumber;
                            while ((StoreNumber = sr.ReadLine()) != null)
                            {
                                Console.WriteLine(StoreNumber);
                                if (StoreNumber.Length > 5)
                                {
                                    using (FileStream fsProfile = File.Open(path, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
                                    using (BufferedStream bsProfile = new BufferedStream(fsProfile))
                                    using (StreamReader srProfile = new StreamReader(bsProfile))
                                    {
                                        outfile.WriteLine(srProfile.ReadToEnd().TrimEnd());
                                        
                                    }

                                }

                            }
                        }
                    }

                }
            }

You mention shell script.你提到了shell脚本。 Here's a working shell example:这是一个工作外壳示例:

while read line; do
  echo "$line" >> Output
  cat File1 >> Output
done < File2

Here the lines of File2 are being looped over and written along with the entirety of File1 into an arbitrary output file Output .这里File2的行被循环并与整个File1写入任意输出文件Output

Easily run by saving it in a local file something.sh and running sh something.sh .通过将其保存在本地文件something.sh并运行sh something.sh轻松运行。

We could further optimise the code for performance, at the cost of memory.我们可以以内存为代价进一步优化代码的性能。 All refactor it to make it cleaner.全部重构它以使其更干净。

File1 : 6000 lines文件 1:6000 行

File2 : 20Mb文件 2 : 20Mb

As File 1, (smaller file) just contains a few number of lines, would read the entire file into memory and loop over it.作为文件 1,(较小的文件)只包含几行,会将整个文件读入内存并循环遍历它。

foreach (string line in File.ReadAllLines(File1))

If you still have memory capacity, you can read the entire second file into memory as well如果您还有内存容量,您也可以将整个第二个文件读入内存

var file2 = File.ReadAllText(File2)

Now all you have to do is append everything to a 3rd file.现在您要做的就是将所有内容附加到第三个文件中。 Which we will not store in memory because of size.由于大小,我们不会将其存储在内存中。

So the entire code will be所以整个代码将是

var file2 = File.ReadAllText(File2);
var destinationFile = "destination/file/path";

foreach (string line in File.ReadAllLines(File1)){
File.AppendAllText(destinationFile, line);
File.AppendAllText(destinationFile, file2);
}

Further Optimisation: Skipped to keep code simple进一步优化:跳过以保持代码简单

File.AppendAllText is called twice, because we don't want to do line + file2 in code. File.AppendAllText 被调用了两次,因为我们不想在代码中做 line + file2。 It will allocate more memory.它将分配更多的内存。

To optimise this further you can use StringBuilder, load file2 into it.要进一步优化,您可以使用 StringBuilder,将 file2 加载到其中。

var file2 = new StringBuilder(File.ReadAllText(File2));

And mutate it.并对其进行变异。 This should, prevent the 2 calls to File.AppendAllText and give more performance.这应该可以防止对 File.AppendAllText 的 2 次调用并提供更高的性能。

It is difficult to reduce I/O time.很难减少 I/O 时间。 You can try the case with a reading/writing in large portions (I think It is more efficient because I/O operations require to allocate/release resources of OS).您可以尝试使用大量读/写的情况(我认为它更有效,因为 I/O 操作需要分配/释放操作系统的资源)。 So if you read all, aggregate the result in-memory, write to file, then it will spend less time on I/O.因此,如果您阅读所有内容,将结果汇总到内存中,然后写入文件,那么它将在 I/O 上花费更少的时间。 A higher speed here is reached by in-memory operations, because RAM and processor operations are very rapid to process in comparison with an IO operation.内存操作可以达到更高的速度,因为与 IO 操作相比,RAM 和处理器操作的处理速度非常快。

  1. File 1 - is small - read it once and keep the results in memory.文件 1 - 很小 - 读取一次并将结果保存在内存中。
  2. File 2 - is large - read it in chunks.文件 2 - 很大 - 分块读取。 For example, you can use streamReader.ReadLine() N times例如,您可以使用 streamReader.ReadLine() N 次
  3. Combine in-memory data of the first file with each chunk of the second one parallelly if possible.如果可能,将第一个文件的内存数据与第二个文件的每个块并行合并。
  4. Output - open/close stream only once, write after each chuck is processed.输出 - 仅打开/关闭流一次,在处理每个卡盘后写入。

PS: no need in buffered streams here because file streams are already buffered. PS:这里不需要缓冲流,因为文件流已经被缓冲了。 Buffered streams are useful for network IO operations.缓冲流对于网络 IO 操作很有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM