I am trying to write a pig script that counts the all characters (special characters and letters) and give the count of each character separately. I have been trying to use the following script, but it only counts letters but does not include special characters like , ? and :. Please help !
A = load 'pigfiles/p.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\\w+';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';
Just use '(.+)'
in place of '\\\\w+'
and it will give you a count of all punctuation and alphabets in the file.
Example:
File: [ cat a.txt
]
"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
Code:
A = load 'a.txt';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '(.+)';
D = foreach C generate flatten(TOKENIZE(REPLACE(word,'','|'), '|')) as letter;
E = group D by letter;
F = foreach E generate COUNT(D), group;
store F into 'pigfiles/wordcount';
Output: cat part-r-00000
4 !
1 ;
3 ?
2 H
1 I
2 L
1 W
1 a
1 c
1 d
3 e
1 g
2 h
3 i
1 j
1 m
3 n
4 o
1 p
1 r
7 s
7 t
4 u
1 w
2 y
The reason you are not getting some of the special characters is due to the fact that TOKENIZE uses space, double quote("), coma(,) parenthesis(()), star(*) as delimiters.
So when you use TOKENIZE on (chararray)$0) the token separators are lost and not accounted for.
So using Ani Menon's sample data,the script and output below.
Input
"HI"
Lets try using some punctuations!? How? Why!?
Lets, just; do this!!
PigScript
A = LOAD 'test5.txt';
B = FOREACH A GENERATE FLATTEN(TOKENIZE(REPLACE((chararray)$0,'','|'), '|')) AS letter;
C = FILTER B BY letter != ' ';
D = GROUP C BY letter;
E = FOREACH D GENERATE COUNT(C.letter), group;
DUMP E;
Output
Here is one solution:
lines = LOAD 'p.txt' AS (line: chararray);
characters = FOREACH lines GENERATE FLATTEN(STRSPLITTOBAG(line, '')) AS character;
charGroups = GROUP characters BY character;
result = FOREACH charGroups GENERATE group, COUNT($1);
store result into 'charcount.txt';
It would produce output that looks like this:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.