简体   繁体   中英

How to analyze frequency of characters in a text file

I have a text file which has approximately 25 millions of lines included. Data on the lines are similiar below:

12ertwrtrdfger
897 erterterte
545ret3w2trewt 345
968587563453 345
897 53647565344553


I want to analyze most frequent prefixes and suffixes. In example above you can see that 2 lines are starting with 897 and two lines are ending with 345, I want to see which prefix/suffixes are the most frequent. I also want to get the results as bar/piechart. Any data analysis program does that kind of analysis?

sed ... <file | sort | uniq -c

The args need to specify to extract the first or last 3 characters.

uniq -c counts the frequency of each string.

Tack on | sort -nbr | sort -nbr of you want to sort by most frequent first.

Tack on | head -10 | head -10 to see only the to 10.

Then feed into LibreCalc to get a spreadsheet with graphing.

sed -E '/^(.....)(.*)$/\1/' <abc.txt | sort | uniq -c >pre5.txt

Last 5, using a different way to specify exactly 5 characters:

sed -E '/^(.*)(.{5})$/\2/' <abc.txt | sort | uniq -c >suf5.txt

However, there is a "bug". When the entire line is less than 5 characters, the short line will be sent to the output.

You can try below python code. It ran in 1.5 mins with a 1GB file matching your description. It had 922180 different prefix and 891532 different suffixes.

pre = {}
suf = {}
with open('input.txt', 'r') as f:
    for line in f:
        p, s = line[:3], line[-4:-1]
        pre[p] = pre.get(p, 0) + 1
        suf[s] = suf.get(s, 0) + 1

df_pre = pd.DataFrame([[e[0], e[1]] for e in pre.items()])
df_suf = pd.DataFrame([[e[0], e[1]] for e in suf.items()])

df_pre.sort_values([1], ascending=False)
df_suf.sort_values([1], ascending=False)

File Generation: 98 distinct characters available in string.printables. The file contained 25 million lines, around 40 characters per line.

I've solved my problem with the code below:

sed abc.txt <abc.txt | cut -c 1-5 | sort | uniq -cd | sort -nbr > pre5.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM