I have a text file which has approximately 25 millions of lines included. Data on the lines are similiar below:
12ertwrtrdfger
897 erterterte
545ret3w2trewt 345
968587563453 345
897 53647565344553
I want to analyze most frequent prefixes and suffixes. In example above you can see that 2 lines are starting with 897 and two lines are ending with 345, I want to see which prefix/suffixes are the most frequent. I also want to get the results as bar/piechart. Any data analysis program does that kind of analysis?
sed ... <file | sort | uniq -c
The args need to specify to extract the first or last 3 characters.
uniq -c
counts the frequency of each string.
Tack on | sort -nbr
| sort -nbr
of you want to sort by most frequent first.
Tack on | head -10
| head -10
to see only the to 10.
Then feed into LibreCalc to get a spreadsheet with graphing.
sed -E '/^(.....)(.*)$/\1/' <abc.txt | sort | uniq -c >pre5.txt
Last 5, using a different way to specify exactly 5 characters:
sed -E '/^(.*)(.{5})$/\2/' <abc.txt | sort | uniq -c >suf5.txt
However, there is a "bug". When the entire line is less than 5 characters, the short line will be sent to the output.
You can try below python code. It ran in 1.5 mins with a 1GB file matching your description. It had 922180 different prefix and 891532 different suffixes.
pre = {}
suf = {}
with open('input.txt', 'r') as f:
for line in f:
p, s = line[:3], line[-4:-1]
pre[p] = pre.get(p, 0) + 1
suf[s] = suf.get(s, 0) + 1
df_pre = pd.DataFrame([[e[0], e[1]] for e in pre.items()])
df_suf = pd.DataFrame([[e[0], e[1]] for e in suf.items()])
df_pre.sort_values([1], ascending=False)
df_suf.sort_values([1], ascending=False)
File Generation: 98 distinct characters available in string.printables. The file contained 25 million lines, around 40 characters per line.
I've solved my problem with the code below:
sed abc.txt <abc.txt | cut -c 1-5 | sort | uniq -cd | sort -nbr > pre5.txt
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.