简体   繁体   English

使用 SSIS 或 SQL 将多个 csv 文件连接到一个表中

[英]Join multiple csv files into a single table using SSIS or SQL

SSIS beginner here, working on project that follows coding languages performance in github throughout the last decade.这里是 SSIS 初学者,在过去十年中致力于遵循 github 中的编码语言性能的项目。 I have 3 datasets :我有 3 个数据集:

  1. Repos回购
  2. Pull Requests拉取请求
  3. Issues.问题。

Issues dataset is as below:问题数据集如下:

问题数据集示例

and here is an example of Repos dataset这是一个 Repos 数据集的例子

回购数据集. .

First issue of mine, I don't need the quarter.我的第一期,我不需要这个季度。 Second, and what brought me here, how to merge these flat files to a single table and sum data to be其次,是什么让我来到这里,如何将这些平面文件合并到一个表中并将数据求和

Language | Nbr of repos | Nbr of Pull requests | Year

Help would be much appreciated, I've tried many ways to figure it out but haven't been able to achieve what I'm looking for.非常感谢您的帮助,我尝试了很多方法来解决这个问题,但无法实现我想要的。

1 - If you don't need data, don't use it. 1 - 如果您不需要数据,请不要使用它。 Preferably, don't bring it into the dataflow pipeline but just because it's there, it doesn't mean you need to use it.最好不要将它带入数据流管道,但仅仅因为它在那里,并不意味着你需要使用它。

I would advocate landing each file/tab/dataset into a matching table in your database.我主张将每个文件/选项卡/数据集放入数据库中的匹配表中。 It allows you to ensure you've landed exactly the same data that was provided.它使您可以确保您获得的数据与所提供的数据完全相同。 Once that's done, write a custom query to join the assorted tables together.完成后,编写自定义查询以将分类表连接在一起。 Something like就像是

SELECT
    R.language
,   R.num_repos AS [Nbr of repos]
,   SUM(PR.[count]) AS [Nbr of Pull requests]
,   PR.year
FROM
    dbo.repos AS R
    -- Assuming inner is accurate
    INNER JOIN
        dbo.PullRequests AS PR
        ON PR.name = R.language
GROUP BY
    R.language
,   R.num_repos
,   PR.year;

The SUM and GROUP BY reflects the fact that you likely want to get the total counts per year and since the data appears to be reported at a quarterly interval, that should wrap that up. SUM 和 GROUP BY 反映了这样一个事实,即您可能希望获得每年的总计数,并且由于数据似乎是按季度报告的,因此应该总结一下。 Or, it could be that this is semi-additive data you need to get the last available quarter for a given name and count (Language monkeyBrainz got renamed to LaconicApe in Q3 2021 so the last entry for MB is in Q3 2021 and LA starts Q4, 2021)或者,这可能是您需要获取给定名称和计数的最后一个可用季度的半相加数据(Language monkeyBrainz 在 2021 年第三季度重命名为 LaconicApe,因此 MB 的最后一个条目是在 2021 年第三季度,洛杉矶从第四季度开始, 2021)

You can then use that custom query as a source in SSIS to generate your extract (to CSV, to Excel, to Apooche Flerken - it'll be huge in 2033)然后,您可以使用该自定义查询作为 SSIS 中的源来生成您的数据提取(到 CSV、到 Excel、到 Apooche Flerken - 它会在 2033 年变得巨大)

SSIS route SSIS 路线

Bringing data together has a few options in SSIS: Merge Join, Lookup, Script Task.在 SSIS 中将数据放在一起有几个选项:Merge Join、Lookup、Script Task。

Merge Join requires data to be sorted based on the key(s).合并联接需要根据键对数据进行排序。 In this case, Name/Language needs to have a Sort transformation from the source component (flat file/excel?).在这种情况下,名称/语言需要从源组件(平面文件/excel?)进行排序转换。 That gets fed into the Merge Join.这被输入到 Merge Join 中。

If you have all 4 quarters in the pull requests and you want to total stats for the year, you'll likely want to have the Aggregate transformation between the source and the Sort to SUM the count by name and year or if it's semi-additive, then find the last quarter for name and year.如果您在拉取请求中有所有 4 个季度,并且您想要汇总当年的统计信息,您可能希望在源和排序之间进行聚合转换,以按名称和年份求和,或者如果它是半加法的,然后找到名称和年份的最后一个季度。

A non-sorting method would involve two data flows and the lookup component.非排序方法将涉及两个数据流和查找组件。 A lookup can make, at most, one match.查找最多可以进行一次匹配。 In the supplied screenshots, I would assume that the Pull Requests is the "driver" table as C++ will be in there from "year of inception" to current.在提供的屏幕截图中,我假设 Pull Requests 是“驱动程序”表,因为 C++ 从“成立之年”到当前都在其中。 Whereas your repos dataset would only have C++ in there once.而你的 repos 数据集只会有一次 C++。 So, we will augment the "current" row from pull requests with a few fields from the repos data set.因此,我们将使用来自 repos 数据集中的一些字段来增加拉取请求中的“当前”行。

The only hiccup here is that the Lookup supports data from an OLE DB provider or the SSIS Cache Connection Manager.这里唯一的问题是查找支持来自 OLE DB 提供程序或 SSIS 缓存连接管理器的数据。 Again, assuming your data is Flat File/Excel, we'll need to use a Cache Connection Manager.同样,假设您的数据是平面文件/Excel,我们将需要使用缓存连接管理器。

Data Flow 1 will "prime the pump" with our Repo data.数据流 1 将使用我们的回购数据“启动泵”。 Your Data Flow will be Source -> Cache Connection Transformation您的数据流将是 Source -> Cache Connection Transformation

Define the Key in your cache connection manager as the Language将缓存连接管理器中的键定义为语言

Cache Connection Manager 缓存连接管理器

In Data Flow 2, you'll use a Lookup Transformation with a source of Cache Connection Manager and then map Name to Language and click the check box next to Language and num_repos to bring them into your pipeline.在数据流 2 中,您将使用带有缓存连接管理器源的查找转换,然后将名称映射到语言并单击语言和 num_repos 旁边的复选框以将它们引入您的管道。

Finally, write the results to wherever you need them.最后,将结果写入您需要的任何地方。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM