简体   繁体   English

从文本文件中提取特定数据 - bash

[英]Extracting specific data from text file - bash

i have a huge text file (~4.5GB in size) that holds ~48 million lines.我有一个巨大的文本文件(约 4.5GB 大小),其中包含约 4800 万行。 all line are in the following syntax:所有行均采用以下语法:

    country01/city01/street01/building01
    country01/city01/street01/building02
    country01/city01/street02/building01
    country01/city01/street02/building02
    country01/city02/street01/building01
    .
    .
    etc...

i'm trying to find a quick way to cut out the street names and the amount of buildings it holds.我试图找到一种快速的方法来删除街道名称和它所拥有的建筑物数量。 i tried various combinations of sed and awk with the wc -l option but it gets messy and i'm definitely missing something.我使用wc -l选项尝试了sedawk的各种组合,但它变得混乱,我肯定错过了一些东西。

will appreciate any help!将不胜感激任何帮助!

If you just need to know the amount of buildings in aa street, you can do the following:如果您只需要知道某条街道上的建筑物数量,您可以执行以下操作:

$ cut -d'/' -f-3 file | sort | uniq -c

This will give you a sorted list of streets and a count next to it这将为您提供排序的街道列表和旁边的计数

2 country01/city01/street01
2 country01/city01/street02
1 country01/city02/street01

If there might be duplicates in your list you can do this:如果您的列表中可能有重复项,您可以这样做:

$ sort -u file | cut -d'/' -f-3 | uniq -c

If you really have an enormous file that might not fit into your memory and sort takes a bit long, you can do the following:如果您确实有一个巨大的文件可能不适合您的 memory 并且sort需要一点时间,您可以执行以下操作:

$ awk 'BEGIN{FS=SUBSEP="/"}{a[$1,$2,$3]++}END{for(i in a) print a[i],i}' file

or if you might have duplicates:或者如果您可能有重复项:

$ awk '($0 in a){next}{print; a[$0]}' file | awk 'BEGIN{FS=SUBSEP="/"}{a[$1,$2,$3]++}END{for(i in a) print a[i],i}'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM