简体   繁体   English

如何对大量不一致的数据进行排序?

[英]How to sort large amounts of inconsistent data?

For my programming class, I have been given 103MB of CSV files to work with. 在我的编程课上,我获得了103MB的CSV文件可以使用。 With a mid-sized file having 40 000 lines, I estimate everything to be around 300 000 lines. 对于具有40 000行的中型文件,我估计所有内容都在30万行左右。 I am trying to sort the data by line and convert it into an object however one field is incredibly inconsistent. 我正在尝试按行对数据进行排序,然后将其转换为对象,但是一个字段的一致性令人难以置信。


Generally, the entire line is: 通常,整行是:

[Station Number], [Acronym of the Parameter Description], [Parameter Description], [Date (MM/DD/YYYY)], [Time], etc... [站号],[参数说明的首字母缩写],[参数说明],[日期(MM / DD / YYYY)],[时间]等...

The parameter description, however is what is inconsistent. 但是,参数说明不一致。 In it, there are different combinations of words and the formatting isn't even the same. 其中,单词的组合不同,格式也不尽相同。 Sometimes things are abbreviated and sometimes there is 1 space in between and sometimes there is 10. 有时缩写,有时之间有1个空格,有时则有10个空格。

Here are some examples of the parameter description field: 以下是参数描述字段的一些示例:

(chemical), (filtered/unfiltered) TOTAL (化学),(过滤/未过滤)总计

(chemical) TOTAL, (filtered/unfiltered) (化学)总计(已过滤/未过滤)

CONDUCTIVITY, 25C 导电性25C

STREAM CONDITION 流条件

(chemical), DISSOLVED (inorganic/organic) (化学),溶解(无机/有机)

Also... sometimes after the chemical there is "UNFILTERED REACTIVE" and sometimes there is "UNFIL.REA" 另外...有时在化学药品后有“未过滤的反应物”,有时有“ UNFIL.REA”


Please help as I have no idea how to go about organizing the parameter description field. 请帮忙,因为我不知道如何组织参数描述字段。 And these are just some that I have found in 6 000 lines. 这些只是我在6000行中发现的一些。 And I can hardly look over 300 000 lines to see what each line has. 而且我几乎看不到30万行来查看每行的内容。

Also, if it helps, this is Ontario Water Stream Quality Data and I am coding in Java (pseudocode is ok, though.) 另外,如果有帮助,这是安大略省水质数据,我正在用Java进行编码(不过,伪代码是可以的。)

If Parameter Description is the only field that contains commas, it should be possible to split line using comma as delimiter, and extract the rest of the fields, starting from the beginning and the end respectively. 如果“ Parameter Description是唯一包含逗号的字段,则应该可以使用逗号作为分隔符来split行,并提取其余字段,分别从开头和结尾开始。 (The first two and the last x resulting strings will correspond to a field each.) (生成的前两个字符串和最后一个x字符串将分别对应于一个字段。)

The remaining strings would be the Parameter Description field, wich can be put together again. 其余的字符串将是“ Parameter Description字段,可以再次放在一起。 Don't forget to restore the commas... 不要忘记恢复逗号...

I have a different approach for this 我对此有不同的方法

I have also applied the same in on of my project. 我在项目中也应用了同样的方法。

this may lead lesser time to sort the whole content in comparision to the current state 与当前状态相比,这可能导致更少的时间来对整个内容进行排序

and also as per KarlP answer you will need to do some extra operation some extra column that may lead more computational time rather than normal 而且根据KarlP的回答,您将需要做一些额外的操作,一些额外的列可能会导致更多的计算时间而不是正常


My Way 我的方式

Create a method which remove all spaces, commas and any other special character and also you may replace the numeric too coz they don't play good role when you are shorting with description like field 创建一个删除所有空格,逗号和任何其他特殊字符的方法,并且您也可以替换数字,因为当您用字段(如字段)进行描述时,它们不能起到很好的作用

so the result will be something like this 所以结果将是这样的

// fn_getOnlyText replaces all chars which don't play role in alphabetical sorting as per your case

   fn_getOnlyText("Parameter Description");

then use this new field to sorting only this will be dictionary based sorting so it would be much faster than the original content. 然后使用此新字段进行排序,这将仅是基于字典的排序,因此比原始内容要快得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM