简体   繁体   English

talend:csv列中间的换行符

[英]talend : newline character in middle of csv column

I am fetching data using tSoap component in which i am getting result in XML format as comma separated values. 我正在使用tSoap组件获取数据,其中以逗号分隔的值的形式获取XML格式的结果。 In which columns are separated by comma and rows are separated by '\\n'. 其中,列用逗号分隔,行用'\\ n'分隔。

After that i am using tExtractXMLField component for extracting data from the response. 之后,我使用tExtractXMLField组件从响应中提取数据。

But in data i have '\\n' within the strings which is treating it as a new row. 但是在数据中,我在字符串中包含“ \\ n”,将其视为新行。 I tried using tReplace component to remove \\n within the quotes using regex but data is too large, result causing StackOverflowError. 我尝试使用tReplace组件使用正则表达式删除引号内的\\ n,但数据太大,导致导致StackOverflowError。

Also I tried using tNomalize component to separate the rows using CSV option, but the problem still persist. 我也尝试使用tNomalize组件使用CSV选项分隔行,但是问题仍然存在。

Can you please help me on this. 你能帮我这个忙吗? Thanks in advance. 提前致谢。

Response which i am getting from the soap request is: 我从肥皂请求中得到的响应是:

  <env:Envelope xmlns:env="http://schemas.xmlsoap.org/soap/envelope/"> <env:Header/> <env:Body> <ns2:getReportResultCsvResponse xmlns:ns2="http://service.admin.ws.five9.com/"> <return>TIMESTAMP,CALL ID,NOTES "Mon, 17 Apr 2017 10:05:38",4223519, "Mon, 17 Apr 2017 10:05:40",4223520, "Mon, 17 Apr 2017 10:05:41",4223521,"Alexandria.. Monday -- 55 partial Bal -- 224 May 1 Visa" "Mon, 17 Apr 2017 10:05:42",4223522, "Mon, 17 Apr 2017 10:05:43",4223523, "Mon, 17 Apr 2017 10:11:04",4223524, "Mon, 17 Apr 2017 10:05:43",4223524, "Mon, 17 Apr 2017 10:05:45",4223525,</return> </ns2:getReportResultCsvResponse> </env:Body> </env:Envelope> 

Here as we can see "notes" column having data which have '\\n' in it in between the quotes, and it is causing issue for extracting data. 在这里我们可以看到“注释”列中的数据在引号之间包含“ \\ n”,这导致提取数据时出现问题。 Can you please tell me how can i resolve this issue. 您能告诉我如何解决这个问题。

In fact your file is a CSV file embedded into a XML file. 实际上,您的文件是嵌入XML文件中的CSV文件。
Because "notes" field is enclosed between ", a solution is to transform the file to pure CSV then, thanks to the appropriate "CSV option", the problem of "\\n" disappears automagically. 由于“ notes”字段位于“之间”,因此一种解决方案是将文件转换为纯CSV,然后借助适当的“ CSV选项”,自动消除“ \\ n”问题。

Here is what the job looks like: 这是工作的样子: 在此处输入图片说明

tFileInputFullRow read the input file as it come in a single field nammed "line" by default. 默认情况下,tFileInputFullRow读取输入文件,因为它位于命名为“ line”的单个字段中。 Just set Header to 4 and Footer to 3 to ignore most of the XML part (supposing the file structure is always the same). 只需将Header设置为4并将Footer设置为3即可忽略大多数XML部分(假设文件结构始终相同)。

Pass the result to tMap just to remove the remaining XML "return" tag not removed by the previous step (because not on a separate line). 将结果传递给tMap只是为了删除剩余的XML“返回”标记,该标记未被上一步删除(因为不在单独的行中)。
Here is the tMap with the replaceAll used to remove this tag: 这是带有replaceAll的tMap,用于删除此标记: 在此处输入图片说明

After the tMap, pass the flow to a pure CSV file using tFileOutputDelimited. 在tMap之后,使用tFileOutputDelimited将流传递到纯CSV文件。 Let all options with the propsed default value. 让所有选项都带有默认值。

Now, start a 2nd subjob with tFileInputDelimited to read the CSV file. 现在,使用tFileInputDelimited启动第二个子作业以读取CSV文件。 Define the schema with the 3 columns "Timestamp", "CallId" and "Notes". 用3列“ Timestamp”,“ CallId”和“ Notes”定义模式。 Set the field separator to "," and the magic, click on "CSV options", nothing else. 将字段分隔符设置为“,”,然后单击“ CSV options”,然后单击“魔术”。

To display only the record with "\\n" in "notes" field, I set the Header to 3 and the Limit 1 (the reason why there is just 1 row after the tFileInputDelimited). 为了仅在“注释”字段中显示带有“ \\ n”的记录,我将Header设置为3,将Limit设置为1(tFileInputDelimited之后仅1行的原因)。
Here is the result: 结果如下: 在此处输入图片说明

As you can see, the field "notes" is dispatched on 4 lines as expected because of the "\\n" characters. 如您所见,由于“ \\ n”字符,字段“ notes”按预期分4行发送。

Regards, 问候,
TRF TRF

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM