[英]Parsing and uploading >1GB of data in C#
我編寫了一個程序來解析並將大量數據上傳到數據庫。 問題是解析方式太慢了。 我的程序的工作方式是我有一個Parser類,它解析(使用並行化)每個文件,並為每個文件中解析的每個條目引發一個事件:
Parallel.ForEach<FileInfo>(
files,
new ParallelOptions { MaxDegreeOfParallelism = maxParallelism },
(inputFile, args) =>
{
// Using underlying FileStream to allow concurrent Read/Write access.
using (var input = new StreamReader(inputFile.FullName))
{
while (!input.EndOfStream)
{
RaiseEntryParsed(ParseCity(input.ReadLine()));
}
ParsedFiles++;
RaiseFileParsed(inputFile);
}
});
RaiseDirectoryParsed(Directory);
“main”程序訂閱此事件,並將條目添加到DataTable以執行SqlBulkCopy; SqlBulkCopy僅在解析器類引發FileParsed事件時(每次解析文件時)提交:
using (SqlBulkCopy bulkCopy = new SqlBulkCopy(_connectionString))
{
DataTable cityTable = DataContext.CreateCityDataTable();
parser.EntryParsed +=
(s, e) =>
{
DataRow cityRow = cityTable.NewRow();
City parsedCity = (City)e.DatabaseEntry;
cityRow["id"] = parsedCity.Id;
...
...
cityTable.Rows.Add(cityRow);
};
parser.FileParsed +=
(s, e) =>
{
bulkCopy.WriteToServer(cityTable);
Dispatcher.BeginInvoke((Action)UpdateProgress);
cityTable.Rows.Clear();
};
parser.DirectoryParsed +=
(s, e) =>
{
bulkCopy.WriteToServer(cityTable);
Dispatcher.BeginInvoke((Action)UpdateProgress);
};
parser.BeginParsing();
}
每次提交后清除表行的原因是為了節省內存並防止來自內存中的這么多實體的OutOfMemoryException ...
我怎樣才能讓它更快,目前速度慢得令人無法接受。 我描述了該應用程序,並表示大部分時間都花在了Entryparsed事件上。 謝謝
我做了一個簡短的測試項目,嘗試了幾種不同的方法。 我的目標是使用僅僅順序代碼盡快建立一個包含27列和(id,A,B,C,...,Z)和NumOfRows的DataTable約300,000。
(每行填充一個id,其余列用隨機的5個字母單詞填充)。
在我的第四次嘗試中,我偶然發現了一種不同的語法,用於根據Object類型的值數組將行添加到表中。 (見這里 )。
在你的情況下,它將是這樣的:
cityTable.Rows.Add( new Object[] {
((City)e.DatabaseEntry).Id ,
ObjectThatGoesInColumn2 ,
ObjectThatGoesInColumn3 ,
ObjectThatGoesInLastColumn
}
代替:
DataRow row = cityTable.NewRow();
row[0] = 100;
row["City Name"] = Anaheim;
row["Column 7"] = ...
...
row["Column 26"] = checksum;
workTable.Rows.Add( row );
這將為您提供加速,因為您不會逐個單獨設置每個列,並且根據您的探查器的圖片,您至少有12個單獨設置的列。
這也使它不會散列列名字符串,以查看您正在處理的數組位置,然后仔細檢查數據類型是否正確。
如果您有興趣,這是我的測試項目:
class Program
{
public static System.Data.DataSet dataSet;
public static System.Data.DataSet dataSet2;
public static System.Data.DataSet dataSet3;
public static System.Data.DataSet dataSet4;
public static Random rand = new Random();
public static int NumOfRows = 300000;
static void Main(string[] args)
{
#region test1
Console.WriteLine("Starting");
Console.WriteLine("");
Stopwatch watch = new Stopwatch();
watch.Start();
MakeTable();
watch.Stop();
Console.WriteLine("Elapsed Time was: " + watch.ElapsedMilliseconds + " milliseconds.");
dataSet = null;
Console.WriteLine("");
Console.WriteLine("Completed.");
Console.WriteLine("");
#endregion
/*
#region test2
Console.WriteLine("Starting Test 2");
Console.WriteLine("");
watch.Reset();
watch.Start();
MakeTable2();
watch.Stop();
Console.WriteLine("Elapsed Time was: " + watch.ElapsedMilliseconds + " milliseconds.");
dataSet2 = null;
Console.WriteLine("");
Console.WriteLine("Completed Test 2.");
#endregion
#region test3
Console.WriteLine("");
Console.WriteLine("Starting Test 3");
Console.WriteLine("");
watch.Reset();
watch.Start();
MakeTable3();
watch.Stop();
Console.WriteLine("Elapsed Time was: " + watch.ElapsedMilliseconds + " milliseconds.");
dataSet3 = null;
Console.WriteLine("");
Console.WriteLine("Completed Test 3.");
#endregion
*/
#region test4
Console.WriteLine("Starting Test 4");
Console.WriteLine("");
watch.Reset();
watch.Start();
MakeTable4();
watch.Stop();
Console.WriteLine("Elapsed Time was: " + watch.ElapsedMilliseconds + " milliseconds.");
dataSet4 = null;
Console.WriteLine("");
Console.WriteLine("Completed Test 4.");
#endregion
//printTable();
Console.WriteLine("");
Console.WriteLine("Press Enter to Exit...");
Console.ReadLine();
}
private static void MakeTable()
{
DataTable table = new DataTable("Table 1");
DataColumn column;
DataRow row;
column = new DataColumn();
column.DataType = System.Type.GetType("System.Int32");
column.ColumnName = "id";
column.ReadOnly = true;
column.Unique = true;
table.Columns.Add(column);
for (int i = 65; i <= 90; i++)
{
column = new DataColumn();
column.DataType = System.Type.GetType("System.String");
column.ColumnName = "5-Letter Word " + (char)i;
column.AutoIncrement = false;
column.Caption = "Random Word " + (char)i;
column.ReadOnly = false;
column.Unique = false;
// Add the column to the table.
table.Columns.Add(column);
}
DataColumn[] PrimaryKeyColumns = new DataColumn[1];
PrimaryKeyColumns[0] = table.Columns["id"];
table.PrimaryKey = PrimaryKeyColumns;
// Instantiate the DataSet variable.
dataSet = new DataSet();
// Add the new DataTable to the DataSet.
dataSet.Tables.Add(table);
// Create three new DataRow objects and add
// them to the DataTable
for (int i = 0; i < NumOfRows; i++)
{
row = table.NewRow();
row["id"] = i;
for (int j = 65; j <= 90; j++)
{
row["5-Letter Word " + (char)j] = getRandomWord();
}
table.Rows.Add(row);
}
}
private static void MakeTable2()
{
DataTable table = new DataTable("Table 2");
DataColumn column;
DataRow row;
column = new DataColumn();
column.DataType = System.Type.GetType("System.Int32");
column.ColumnName = "id";
column.ReadOnly = true;
column.Unique = true;
table.Columns.Add(column);
for (int i = 65; i <= 90; i++)
{
column = new DataColumn();
column.DataType = System.Type.GetType("System.String");
column.ColumnName = "5-Letter Word " + (char)i;
column.AutoIncrement = false;
column.Caption = "Random Word " + (char)i;
column.ReadOnly = false;
column.Unique = false;
// Add the column to the table.
table.Columns.Add(column);
}
DataColumn[] PrimaryKeyColumns = new DataColumn[1];
PrimaryKeyColumns[0] = table.Columns["id"];
table.PrimaryKey = PrimaryKeyColumns;
// Instantiate the DataSet variable.
dataSet2 = new DataSet();
// Add the new DataTable to the DataSet.
dataSet2.Tables.Add(table);
// Create three new DataRow objects and add
// them to the DataTable
for (int i = 0; i < NumOfRows; i++)
{
row = table.NewRow();
row.BeginEdit();
row["id"] = i;
for (int j = 65; j <= 90; j++)
{
row["5-Letter Word " + (char)j] = getRandomWord();
}
row.EndEdit();
table.Rows.Add(row);
}
}
private static void MakeTable3()
{
DataTable table = new DataTable("Table 3");
DataColumn column;
column = new DataColumn();
column.DataType = System.Type.GetType("System.Int32");
column.ColumnName = "id";
column.ReadOnly = true;
column.Unique = true;
table.Columns.Add(column);
for (int i = 65; i <= 90; i++)
{
column = new DataColumn();
column.DataType = System.Type.GetType("System.String");
column.ColumnName = "5-Letter Word " + (char)i;
column.AutoIncrement = false;
column.Caption = "Random Word " + (char)i;
column.ReadOnly = false;
column.Unique = false;
// Add the column to the table.
table.Columns.Add(column);
}
DataColumn[] PrimaryKeyColumns = new DataColumn[1];
PrimaryKeyColumns[0] = table.Columns["id"];
table.PrimaryKey = PrimaryKeyColumns;
// Instantiate the DataSet variable.
dataSet3 = new DataSet();
// Add the new DataTable to the DataSet.
dataSet3.Tables.Add(table);
DataRow[] newRows = new DataRow[NumOfRows];
for (int i = 0; i < NumOfRows; i++)
{
newRows[i] = table.NewRow();
}
// Create three new DataRow objects and add
// them to the DataTable
for (int i = 0; i < NumOfRows; i++)
{
newRows[i]["id"] = i;
for (int j = 65; j <= 90; j++)
{
newRows[i]["5-Letter Word " + (char)j] = getRandomWord();
}
table.Rows.Add(newRows[i]);
}
}
private static void MakeTable4()
{
DataTable table = new DataTable("Table 2");
DataColumn column;
column = new DataColumn();
column.DataType = System.Type.GetType("System.Int32");
column.ColumnName = "id";
column.ReadOnly = true;
column.Unique = true;
table.Columns.Add(column);
for (int i = 65; i <= 90; i++)
{
column = new DataColumn();
column.DataType = System.Type.GetType("System.String");
column.ColumnName = "5-Letter Word " + (char)i;
column.AutoIncrement = false;
column.Caption = "Random Word " + (char)i;
column.ReadOnly = false;
column.Unique = false;
// Add the column to the table.
table.Columns.Add(column);
}
DataColumn[] PrimaryKeyColumns = new DataColumn[1];
PrimaryKeyColumns[0] = table.Columns["id"];
table.PrimaryKey = PrimaryKeyColumns;
// Instantiate the DataSet variable.
dataSet4 = new DataSet();
// Add the new DataTable to the DataSet.
dataSet4.Tables.Add(table);
// Create three new DataRow objects and add
// them to the DataTable
for (int i = 0; i < NumOfRows; i++)
{
table.Rows.Add(
new Object[] {
i,
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord(),
getRandomWord()
}
);
}
}
private static string getRandomWord()
{
char c0 = (char)rand.Next(65, 90);
char c1 = (char)rand.Next(65, 90);
char c2 = (char)rand.Next(65, 90);
char c3 = (char)rand.Next(65, 90);
char c4 = (char)rand.Next(65, 90);
return "" + c0 + c1 + c2 + c3 + c4;
}
private static void printTable()
{
foreach (DataRow row in dataSet.Tables[0].Rows)
{
Console.WriteLine( row["id"] + "--" + row["5-Letter Word A"] + " - " + row["5-Letter Word Z"] );
}
}
}
我還沒有真正看過你的並行性,但有幾件事情。
首先,改變“ParsedFiles ++;” to“Interlocked.Increment(ref ParsedFiles);”,或通過鎖定它。
其次,我建議使用非常適合這種情況的管道模式,而不是復雜的事件驅動的並行性。
使用並發集合中的並發隊列(或阻塞集合)來保存階段。
第一階段將保存要處理的文件列表。
工作人員任務將從該工作列表中取出文件,解析它,然后將其添加到第二階段。
在第二階段,工作人員任務將從第二階段隊列(剛剛完成的數據表塊)中獲取項目,並在准備上載時將其上載到數據庫。
編輯:
我寫了一個Pipelined版本的代碼,可以幫助你:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections.Concurrent;
using System.Threading.Tasks;
using System.IO;
using System.Data;
namespace dataTableTesting2
{
class Program
{
private static const int BufferSize = 20; //Each buffer can only contain this many elements at a time
//This limits the total amount of memory
private static const int MaxBlockSize = 100;
private static BlockingCollection<string> buffer1 = new BlockingCollection<string>(BufferSize);
private static BlockingCollection<string[]> buffer2 = new BlockingCollection<string[]>(BufferSize);
private static BlockingCollection<Object[][]> buffer3 = new BlockingCollection<Object[][]>(BufferSize);
/// <summary>
/// Start Pipelines and wait for them to finish.
/// </summary>
static void Main(string[] args)
{
TaskFactory f = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.None);
Task stage0 = f.StartNew(() => PopulateFilesList(buffer1));
Task stage1 = f.StartNew(() => ReadFiles(buffer1, buffer2));
Task stage2 = f.StartNew(() => ParseStringBlocks(buffer2, buffer3));
Task stage3 = f.StartNew(() => UploadBlocks(buffer3) );
Task.WaitAll(stage0, stage1, stage2, stage3);
/*
// Note for more workers on particular stages you can make more tasks for each stage, like the following
// which populates the file list in 1 task, reads the files into string[] blocks in 1 task,
// then parses the string[] blocks in 4 concurrent tasks
// and lastly uploads the info in 2 tasks
TaskFactory f = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.None);
Task stage0 = f.StartNew(() => PopulateFilesList(buffer1));
Task stage1 = f.StartNew(() => ReadFiles(buffer1, buffer2));
Task stage2a = f.StartNew(() => ParseStringBlocks(buffer2, buffer3));
Task stage2b = f.StartNew(() => ParseStringBlocks(buffer2, buffer3));
Task stage2c = f.StartNew(() => ParseStringBlocks(buffer2, buffer3));
Task stage2d = f.StartNew(() => ParseStringBlocks(buffer2, buffer3));
Task stage3a = f.StartNew(() => UploadBlocks(buffer3) );
Task stage3b = f.StartNew(() => UploadBlocks(buffer3) );
Task.WaitAll(stage0, stage1, stage2a, stage2b, stage2c, stage2d, stage3a, stage3b);
*/
}
/// <summary>
/// Adds the filenames to process into the first pipeline
/// </summary>
/// <param name="output"></param>
private static void PopulateFilesList( BlockingCollection<string> output )
{
try
{
buffer1.Add("file1.txt");
buffer1.Add("file2.txt");
//...
buffer1.Add("lastFile.txt");
}
finally
{
output.CompleteAdding();
}
}
/// <summary>
/// Takes filnames out of the first pipeline, reads them into string[] blocks, and puts them in the second pipeline
/// </summary>
private static void ReadFiles( BlockingCollection<string> input, BlockingCollection<string[]> output)
{
try
{
foreach (string file in input.GetConsumingEnumerable())
{
List<string> list = new List<string>(MaxBlockSize);
using (StreamReader sr = new StreamReader(file))
{
int countLines = 0;
while (!sr.EndOfStream)
{
list.Add( sr.ReadLine() );
countLines++;
if (countLines > MaxBlockSize)
{
output.Add(list.ToArray());
countLines = 0;
list = new List<string>(MaxBlockSize);
}
}
if (list.Count > 0)
{
output.Add(list.ToArray());
}
}
}
}
finally
{
output.CompleteAdding();
}
}
/// <summary>
/// Takes string[] blocks from the second pipeline, for each line, splits them by tabs, and parses
/// the data, storing each line as an object array into the third pipline.
/// </summary>
private static void ParseStringBlocks( BlockingCollection<string[]> input, BlockingCollection< Object[][] > output)
{
try
{
List<Object[]> result = new List<object[]>(MaxBlockSize);
foreach (string[] block in input.GetConsumingEnumerable())
{
foreach (string line in block)
{
string[] splitLine = line.Split('\t'); //split line on tab
string cityName = splitLine[0];
int cityPop = Int32.Parse( splitLine[1] );
int cityElevation = Int32.Parse(splitLine[2]);
//...
result.Add(new Object[] { cityName, cityPop, cityElevation });
}
output.Add( result.ToArray() );
}
}
finally
{
output.CompleteAdding();
}
}
/// <summary>
/// Takes the data blocks from the third pipeline, and uploads each row to SQL Database
/// </summary>
private static void UploadBlocks(BlockingCollection<Object[][]> input)
{
/*
* At this point 'block' is an array of object arrays.
*
* The block contains MaxBlockSize number of cities.
*
* There is one object array for each city.
*
* The object array for the city is in the pre-defined order from pipeline stage2
*
* You could do a couple of things at this point:
*
* 1. declare and initialize a DataTable with the correct column types
* then, do the dataTable.Rows.Add( rowValues )
* then, use a Bulk Copy Operation to upload the dataTable to SQL
* http://msdn.microsoft.com/en-us/library/7ek5da1a
*
* 2. Manually perform the sql commands/transactions similar to what
* Kevin recommends in this suggestion:
* http://stackoverflow.com/questions/1024123/sql-insert-one-row-or-multiple-rows-data/1024195#1024195
*
* I've demonstrated the first approach with this code.
*
* */
DataTable dataTable = new DataTable();
//set up columns of dataTable here.
foreach (Object[][] block in input.GetConsumingEnumerable())
{
foreach (Object[] rowValues in block)
{
dataTable.Rows.Add(rowValues);
}
//do bulkCopy to upload table containing MaxBlockSize number of cities right here.
dataTable.Rows.Clear(); //Remove the rows when you are done uploading, but not the dataTable.
}
}
}
}
它將工作分為4個部分,可以通過不同的任務完成:
制作要處理的文件列表
從該列表中獲取文件並將其讀入string []
從前一部分獲取字符串[]並解析它們,創建包含表格每行值的object []
將行上傳到數據庫
為每個階段分配多個任務也很容易,如果需要,允許多個工作人員執行相同的管道階段。
(我懷疑從文件中讀取多個任務是有用的,除非你使用固態驅動器,因為在內存中跳轉非常慢)。
此外,您可以通過執行程序設置內存中數據量的限制。
每個緩沖區都是一個用最大大小初始化的BlockingCollection,這意味着如果緩沖區已滿,而另一個任務嘗試添加另一個元素,它將阻止該任務。
幸運的是,任務並行庫是智能的,如果任務被阻止,它將安排一個未被阻止的不同任務,並稍后檢查以查看第一個任務是否已停止被阻止。
目前每個緩沖區只能容納20個項目,每個項目只有100個大項,這意味着:
buffer1將隨時包含多達20個文件名。
buffer2將隨時包含來自這些文件的最多20個字符串塊(由100行組成)。
buffer3將隨時包含最多20項數據塊(100個城市的對象值)。
因此,這需要足夠的內存來容納20個文件名,2000個文件行和2000個城市信息。 (對於局部變量等有一點額外的)。
您可能希望增加BufferSize和MaxBlockSize以提高效率,盡管如此,這應該可行。
注意,我沒有測試過,因為我沒有任何輸入文件,所以可能會有一些錯誤。
雖然我同意其他一些評論和答案你嘗試過:
cityTable.Rows.BeginEdit()
在第一個項目添加到城市表之前。
然后打電話給:
cityTable.Rows.EndEdit()
在FileParased事件處理程序中。
如果你正在尋找原始性能,這樣的東西不是最好的選擇嗎? 它完全繞過了數據表代碼,這似乎是一個不必要的步驟。
void BulkInsertFile(string fileName, string tableName)
{
FileInfo info = new FileInfo(fileName);
string name = info.Name;
string shareDirectory = ""; //the path of the share: \\servername\shareName\
string serverDirectory = ""; //the local path of the share on the server: C:\shareName\
File.Copy(fileName, shareDirectory + name);
// or you could call your method to parse the file and write it to the share directory.
using (SqlConnection cnn = new SqlConnection("connectionString"))
{
cnn.Open();
using (SqlCommand cmd = cnn.CreateCommand())
{
cmd.CommandText = string.Format("bulk insert {0} from '{1}' with (fieldterminator = ',', rowterminator = '\n')", tableName, serverDirectory + name);
try
{
cmd.ExecuteScalar();
}
catch (SqlException ex)
{
MessageBox.Show(ex.Message);
}
}
}
}
以下是有關bulk insert
命令的一些信息。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.