简体   繁体   English

FTP使用Talend,只获取最新文件?

[英]FTP using Talend, get only most recent file?

I have a Talend job that I need to pull down an XML file from an sFTP server to then be processed into an Oracle database. 我有一个Talend作业,我需要从sFTP服务器下载XML文件,然后处理成Oracle数据库。 The date of the XML extraction is in the file name, for example "FileNameHere_Outbound_201407092215.xml", which I believe is yyyyMMddhhmm formatting. XML提取的日期在文件名中,例如“FileNameHere_Outbound_201407092215.xml”,我相信是yyyyMMddhhmm格式。 The beginning portion where "FileNameHere" is the same for all the files. “FileNameHere”的开头部分对于所有文件都是相同的。 I need to be able to read the date from the end of the file name and only pull that one down from the server to be processed. 我需要能够从文件名末尾读取日期,并且只从服务器中取出该日期以进行处理。

I am not sure how to do this with FTP. 我不知道如何用FTP做到这一点。 I've previously used tFilelist to order the items by date descending, but that is not an option with FTP. 我之前使用过tFilelist按日期降序排序,但这不是FTP的选项。 I know it probably has some Java involved in how to pull the portion of the File Name out, but I'm not very Java-literate. 我知道它可能有一些Java涉及如何拉出文件名的部分,但我不是很懂Java。 I can manage though with a bit of assistance. 我可以通过一些帮助来管理。

Does anyone have any insight on how to only download the most recent file from an FTP? 有没有人知道如何只从FTP下载最新的文件?

There's a tFTPFileList component on the palette. 调色板上有一个tFTPFileList组件。 That should give you a list of all the files on the FTP location. 这应该为您提供FTP位置上所有文件的列表。 From here you then want to parse out the time stamp which could be done with a regular expression or alternatively by substringing it depending on which you feel more comfortable with. 从这里开始,您需要解析可以使用正则表达式完成的时间戳,或者根据您感觉更舒适的方式对其进行子串。

Then it's just a case of sorting by the extracted time stamp and then that gives you the newest file name so you can then go fetch that specific file. 然后,这只是一个按提取的时间戳排序然后为您提供最新文件名的情况,以便您可以获取该特定文件。

Here's an outline of an overly laborious way to get this done but it works. 这里有一个过于费力的方法来完成这项工作,但它有效。 You should be able to tweak this easily yourself too: 你也应该能够轻松地调整它:

作业布局示例

In the above job design I've gone for a tFileList rather than a tFTPFileList because I don't have an example FTP location to play with for testing here. 在上面的工作设计中,我选择了tFileList而不是tFTPFileList,因为我没有一个示例FTP位置可以在这里进行测试。 The premise stays the same although this would be pointless with a real tFileList due to the ability to sort by modified date (among other options). 前提保持不变,尽管由于能够按修改日期(以及其他选项)进行排序,因此对于真实的tFileList而言这将是毫无意义的。

We start off by running the tFileList/tFTPFileList component to iterate through all the files (it's possible to file mask these too to limit what you return here) in the location. 我们首先运行tFileList / tFTPFileList组件来遍历所有文件(它们可以对这些文件进行掩码,以限制你在这里返回的内容)。 We then read this in iteratively to a tFixedFlowInput component which allows us to retrieve the values from the globalMap as the tFileList/tFTPFileList iterates through each file: 然后我们迭代地将它读到tFixedFlowInput组件,它允许我们在tFileList / tFTPFileList遍历每个文件时从globalMap中检索值:

使用tFixedFlowInput组件从globalMap检索值

I've listed everything that the tFileList provides (you can see the options by pressing ctrl+space ) but you only really need the file name and potentially the file path or file directory. 我列出了tFileList提供的所有内容(您可以通过按ctrl+space查看选项),但您只需要文件名,可能还需要文件路径或文件目录。 From here we then throw everything into a buffer with a tBufferOutput component so that we can gather every iteration of the location. 然后我们将所有内容都放入一个带有tBufferOutput组件的缓冲区中,以便我们可以收集该位置的每个迭代。

Once the tFileList/tFTPFileList has iterated through every file in the directory it then triggers the next sub job with an OnSubjobOk link where we start by reading the completed buffer back in with a tBufferInput component. 一旦tFileList / tFTPFileList迭代了目录中的每个文件,它就会触发带有OnSubjobOk链接的下一个子作业,我们首先用tBufferInput组件读回已完成的缓冲区。 At this point I've started scattering tLogRow components throughout the flow so I can better visualise the data at each step. 此时我已经开始在整个流程中散布tLogRow组件,这样我就可以更好地可视化每一步的数据。

After this we then use a tExtractRegexFields component to extract the date time stamp from the file name: 之后,我们使用tExtractRegexFields组件从文件名中提取日期时间戳:

该图显示了tExtractRegexFields组件的配置

Here, I am using the following regex "^.+?_Outbound_([0-9]{12})\\\\.xml$" to capture the date time stamp. 在这里,我使用以下正则表达式"^.+?_Outbound_([0-9]{12})\\\\.xml$"来捕获日期时间戳。 It relies on the file name being a combination of any characters, followed by the string literal _Outbound_ , then followed by the date time stamp that we want to capture (which is represented by 12 numeric characters) and then finished with .xml . 它依赖于文件名是任何字符的组合,后跟字符串文字_Outbound_ ,然后是我们想要捕获的日期时间戳(由12个数字字符表示),然后用.xml结束。

We also add a column to our schema to accommodate the captured date time stamp like so: 我们还在我们的模式中添加一列以适应捕获的日期时间戳,如下所示:

tExtractRegexFields组件的模式

As the extra column is a date time stamp of the form yyyyMMddhhmm we can specify this directly here and use it as a date object from then on. 由于额外列是yyyyMMddhhmm形式的日期时间戳,我们可以在此直接指定它并从此开始将其用作日期对象。

From here we simply sort by date descending on the extracted date time stamp column and then use a tSampleRow to take only the first row of the flow of data as per the guidelines on the component configuration. 从这里开始,我们只需按提取的日期时间戳列中的日期降序进行排序,然后使用tSampleRow根据组件配置的指导仅获取数据流的第一行。

To finish this job you would then output the target file path to the globalMap (either in a tJavaRow or using a tFlowToIterate that will automatically do this for you) and then use the globalMap stored file path in the tFTPFileGet's file mask setting: 要完成此任务,您可以将目标文件路径输出到globalMap(在tJavaRow中或使用将自动为您执行此操作的tFlowToIterate),然后在tFTPFileGet的文件掩码设置中使用globalMap存储文件路径:

使用tFlowToIterate和tFTPGet配置将数据放入globalMap

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM