简体   繁体   中英

pig script for counting characters

I am trying to write a pig script that counts the all characters (special characters and letters) and give the count of each character separately. I have been trying to use the following script, but it only counts letters but does not include special characters like , ? and :. Please help !

A = load 'pigfiles/p.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';

Just use '(.+)' in place of '\\\\w+' and it will give you a count of all punctuation and alphabets in the file.

Example:

File: [ cat a.txt ]

"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!

Code:

A = load 'a.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '(.+)';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';

Output: cat part-r-00000

4       !
1       ;
3       ?
2       H
1       I
2       L
1       W
1       a
1       c
1       d
3       e
1       g
2       h
3       i
1       j
1       m
3       n
4       o
1       p
1       r
7       s
7       t
4       u
1       w
2       y

The reason you are not getting some of the special characters is due to the fact that TOKENIZE uses space, double quote("), coma(,) parenthesis(()), star(*) as delimiters.

So when you use TOKENIZE on (chararray)$0) the token separators are lost and not accounted for.

So using Ani Menon's sample data,the script and output below.

Input

"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!

PigScript

A = LOAD 'test5.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B  BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;

Output

产量

Here is one solution:

lines = LOAD 'p.txt' AS (line: chararray);

characters = FOREACH lines GENERATE FLATTEN(STRSPLITTOBAG(line, '')) AS character;

charGroups = GROUP characters BY character;

result = FOREACH charGroups GENERATE group, COUNT($1);

store result into 'charcount.txt';

It would produce output that looks like this:

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM