Linux shell scripting: How can I remove initial numbers in a word list file?

Question

I have this example list text file ( one word per line ):

John
J0hn
45John
Smith
Sm1th
Jane
333Jane
555Doe
12345

And I want to obtain :

John
J0hn
Smith
Sm1th
Jane
Doe
12345

This is: I would like to remove numbers to the end of the words (note that numbers inside words are allowed) and then (as long as the line is the same) remove duplicates .
Note that only numbers before letters must be deleted, so 12345 will remain in the list. I have some experience in programming, so I could implement some loop/s to check for those numbers, and then another loop/s to remove duplicate words, but I think the Linux Shell must have some simple commands or parameter expansions that could solve this for me.

Removing original file sorting is a possibility, but it would be fine if some method does not require it.
Whitespaces are not expected in this kind of "dictionary" text files.

Ideas are welcome. Thanks you.

Intended use:

Isolating words used in passwords databases (John, 45John, 12345John) to obtain statistics of diversity.

Note: this very similar question could help anyone trying to answer. I am not sure about the syntax with perl , awk and sed , so I prefer to ask instead of doing myself some strange mod that could result in disaster.

Answer 1

You can use sed for this:

sed -r 's/^[0-9]+(.*[^0-9].*)$/\1/g'

If I run this on your file, I got:

John
J0hn
John
Smith
Sm1th
Jane
Jane
Doe
12345

You can then use perl to filter out duplicates :

sed -r 's/^[0-9]+(.*[^0-9].*)$/\1/g' | perl -ne 'print unless $seen{$_}++'

Which gives:

John
J0hn
Smith
Sm1th
Jane
Doe
12345

Answer 2

You should use the sed answer, which is going to be really fast, but just for fun here's an answer in pure posix shell, since your question was about shell scripting:

while read i; do
    o="$i"
    while 1; do
       l=${i#[0-9]}
       [ "$l" == "$i" -o -z "$l" ] && break
       i="$l"
   done
   [ -z "$i" ] && echo "$i" || echo $o
done < file.txt

(Okay, I cheated, [ (aka /bin/test) is not always a built-in command.)

Answer 3

This should do it:

sed -r 's/^[0-9]+([A-Za-z])/\1/g' | sort -u

The regexp matches a sequence of digits at the beginning of the line followed by a letter. The capture group gets the letter, and the whole match is replaced by the letter.

Piping to sort -u gets rid of the duplicates.

Linux shell scripting: How can I remove initial numbers in a word list file?

Question

3 answers

solution1
4 2015-01-01 01:13:22

solution2
3 2015-01-01 01:23:53

solution3
1 2015-01-01 05:34:24

Linux shell scripting: How can I remove initial numbers in a word list file?

Question

3 answers

solution1 4 2015-01-01 01:13:22

solution2 3 2015-01-01 01:23:53

solution3 1 2015-01-01 05:34:24

solution1
4 2015-01-01 01:13:22

solution2
3 2015-01-01 01:23:53

solution3
1 2015-01-01 05:34:24