I am totally a beginner in Perl. I have a large file (around 100 G) which looks like this:
domain, ip
"www.google.ac.",173.194.33.111
"www.google.ac.",173.194.33.119
"www.google.ac.",173.194.33.120
"www.google.ac.",173.194.33.127
"www.google.ac.",173.194.33.143
"apple.com., 173.194.33.143
"studio.com.", 173.194.33.143
"www.google.ac.",101.78.156.201
"www.google.ac.",101.78.156.201
So basically I have 1-duplicate lines, 2- one ip with different domains, 3- one domain with different ips. and I would like to remove the duplicate lines from the file (the ones with same domain,ip pair).
**I have already reviewed other answers in regards to the same question, none of them address my problem with large files .
Does anybody have a clue how can I do it in PERL? or any suggestion for more optimal language?
The easiest thing to do is read the file a line at a time and use each line as the key of a hash. You have to have memory to store each unique line once, though. There's no getting around that.
Here's a one-liner as run from the shell:
perl -ne '$lines{$_}++; END { print keys %lines }' filename
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.