简体   繁体   English

Linux:计算文件中的空格和其他字符

[英]Linux: counting spaces and other characters in file

Problem: 问题:

I need to match an exact format for a mailing machine software program. 我需要匹配邮件机软件程序的确切格式。 It expects a certain format. 它期望某种格式。 I can count the number of new lines, carriage returns, tabs ...etc. 我可以计算新行,回车,标签等的数量。 using tools like 使用像

cat -vte

and

od -c

and

wc -l ( or wc -c )

However, I'd like to know the exact number of leading and trailing spaces between characters and sections of text. 但是,我想知道字符和文本部分之间的前导和尾随空格的确切数量。 Tabs as well. 标签也是如此。

Question: 题:

How would you go about analyzing then matching a template exactly using common unix tools + perl or python? 您将如何分析然后使用常见的unix工具+ perl或python完全匹配模板? One-liners preferred. 一线人更喜欢。 Also, what's your advice for matching a DOS encoded file? 另外,您对匹配DOS编码文件的建议是什么? Would you translate it to NIX first, then analyze, or leave, as is? 你会先把它翻译成NIX,然后按原样分析或离开吗?

UPDATE UPDATE

Using this to see individual spaces [ assumes no '%' chars in file ]: 使用它来查看单个空格[假设文件中没有'%'字符]:

sed 's/ /%/g' filename.000

Plan to build a script that analyzes each line's tab and space content. 计划构建一个分析每行选项卡和空间内容的脚本。

Using @shiplu's solution with a nod to the anti-cat crowd: 使用@ shiplu的解决方案,向抗猫人群致敬:

while read l;do echo $l;echo $((`echo $l |  wc -c` - `echo $l | tr -d ' ' | wc -c`));done<filename.000

Still needs some tweaks for Windows but it's well on it's way. 仍然需要对Windows进行一些调整,但它的方式很好。

SAMPLE TEXT 示范文本

Key for reading: 阅读的关键:

newlines marked with \\n 标有\\ n的换行符

Carriage returns marked with \\r 回车标有\\ r \\ n

Unknown space/tab characters marked with [:space:] ( need counts on those ) 标有[:space:]的未知空格/制表符(需要点数)

\r\n
\n
[:space:]Institution Anon LLC\r\n
[:space:]123 Blankety St\r\n
[:space:]Greater Abyss, AK  99999\r\n
\n
\n
[:space:]                                10/27/2011\r\n
[:space:]Requested materials are available for pickup:\r\n
[:space:]e__\r[:space:]                     D_ \r[:space:]   _O\r\n
[:space:]Bathtime for BonZo[:space:]       45454545454545[:space:]  10/27/2011\r\n
[:space:]Bathtime for BonZo[:space:]       45454545454545[:space:]  10/27/2011\r\n
\n
\n
\n
\n
\n
\n
[:space:]                             Pantz McManliss\r\n
[:space:]                             Gibberish Ave\r\n
[:space:]                             Northern Mirkwood, ME  99999\r\n
( untold variable amounts of \n chars go here )

UPDATE 2 更新2

Using IFS with read gives similar results to the ruby posted by someone below. 将IFS与read一起使用会给下面某人发布的ruby提供类似的结果。

while IFS='' read -r line
 do 
     printf "%s\n" "$line" | sed 's/ /%/g' | grep -o '%' | wc -w
 done < filename.000
perl -nlE'say 0+( () = /\s/g );'

Unlike the currently accepted answer, this doesn't split the input into fields, discarding the result. 与当前接受的答案不同,这不会将输入拆分为字段,从而丢弃结果。 It also doesn't needlessly create an array just to count the number of values in a list. 它也不会不必要地创建一个数组来计算列表中的值的数量。

Idioms used: 使用的习语:

  • 0+( ... ) imposes scalar context like scalar( ... ) , but it's clearer because it tells the reader a number is expected. 0+( ... )强加标量上下文,如scalar( ... ) ,但它更清晰,因为它告诉读者一个数字是预期的。
  • List assignment in scalar context returns the number of elements returned by its RHS, so 0+( () = /.../g ) gives the number of times () = /.../g matched. 标量上下文中的列表赋值返回其RHS返回的元素数,因此0+( () = /.../g )给出匹配的次数() = /.../g
  • -l , when used with -n , will cause the input to be "chomped", so this removes line feeds from the count. -l-n使用时,会导致输入“chomped”,因此会从计数中删除换行符。

If you're just interested in spaces (U+0020) and tabs (U+0009), the following is faster and simpler: 如果您只对空间(U + 0020)和制表符(U + 0009)感兴趣,则以下内容更快更简单:

perl -nE'say tr/ \t//;'

In both cases, you can pass the input via STDIN or via a file named by an argument. 在这两种情况下,您都可以通过STDIN或通过参数命名的文件传递输入。

Regular expressions in Perl or Python would be the way to go here. Perl或Python中的正则表达式将是这里的方法。

Yes, it may take an initial time investment to learn "perl, schmerl, zwerl" but once you've gained experience with an extremely powerful tool like Regular Expressions, it can save you an enormous amount of time down the road. 是的,可能需要花费初始时间来学习“perl,schmerl,zwerl”,但是一旦你获得了像Regular Expressions这样非常强大的工具的经验,它可以为你节省大量的时间。

在此输入图像描述

counting blanks: 计算空白:

sed 's/[^ ]//g' FILE | tr -d "\n" | wc -c

before, behind and between text. 文本之前,之后和之间。 Do you want to count newlines, tabs, etc. in the same go and sum them up, or as separate step? 您是想在同一个计划中计算换行符,制表符等并将它们相加,还是作为单独的步骤?

perl -nwE 'print; for my $s (/([\t ]+)/g) { say "Count: ", length $s }' input.txt

This will count individual groups of tab or space, instead of counting all the whitespace in the entire line. 这将计算制表符或空格的各个组,而不是计算整行中的所有空格。 For example: 例如:

    foo        bar

Will print 会打印

    foo        bar
Count: 4
Count: 8

You may wish to skip single spaces (spaces between words). 您可能希望跳过单个空格(单词之间的空格)。 Ie don't count the spaces in Bathtime for BonZo . 即不要计算Bathtime for BonZo空间。 If so, replace + with {2,} or whatever minimum you think is appropriate. 如果是这样,请将+替换为{2,}或您认为合适的最小值。

If you want to count the number of space s in pm.txt , this command will do, 如果你想计算pm.txtspace数,这个命令会做,

 cat pm.txt | while read l; 
 do echo $((`echo $l |  wc -c` - `echo $l | tr -d ' ' | wc -c`));
 done;

If you want to count the number of space s, \\r , \\n , \\t use this, 如果你想计算space数s, \\r\\n\\t使用这个,

cat pm.txt | while read l;
do echo $((`echo $l |  wc -c` - `echo $l | tr -d ' \r\n\t' | wc -c`));
done;

read will strip any leading characters. read将删除任何前导字符。 If you dont want it, there is a nasty way. 如果你不想要它,那就有一种讨厌的方式。 First split your file so that only 1 lines are there per file using 首先拆分文件,使每个文件只使用1行

`split -l 1 -d pm.txt`. 

After that there will be bunch of x* files. 之后会有一堆x*文件。 Now loop through it. 现在循环它。

for x in x*; do echo $((`cat $x |  wc -c` - `cat $x | tr -d ' \r\n\t' | wc -c`)); done;

Remove the those files by rm x* ; rm x*删除那些文件;

In case Ruby counts (it does count :) 如果Ruby计数(它确实计数 :)

ruby -lne 'puts scan(/\\s/).size'

and now some Perl (slightly less intuitive IMHO): 现在有些Perl(稍微不那么直观的恕我直言):

perl -lne 'print scalar(@{[/(\\s)/g]})'

If you ask me, I'd write a simple C program to do the counting and formatting all in one go. 如果你问我,我会写一个简单的C程序来一次性完成计数和格式化。 But that's just me. 但那只是我。 By the time I got finished fiddle-farting around with perl, schmerl, zwerl I'd have wasted half a day. 当我用perl,schmerl,zwerl完成小提琴放屁时,我已经浪费了半天时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何拆分可能由注释和空格,制表符,换行符,逗号或其他字符组合分隔的字符串或文件 - How to split a string or file that may be delimited by a combination of comments and spaces, tabs, newlines, commas, or other characters grep在文件中清楚地记录字符时无法准确计数 - grep not counting characters accurately when they are clearly in file 用带有空格和特殊字符的句子替换文件中的键 - replace a key in a file by a sentence with spaces and special characters 正则表达式中允许带字母数字的问号和其他字符的空格 - Allowing question mark with alphanumeric and spaces with other characters in regex 智能将大文本拆分为单词和符号,例如空格和其他字符 - Smart split large text into words and signs, like spaces and other characters 正则表达式,允许5-10个字符,但中间可以有空格(不计算在内) - regex that allows 5-10 characters but can have spaces in-between not counting jQuery-如何在计算X数量的空格后向字符串添加字符 - JQuery - how to add characters to a string of words after counting X amount of spaces 从文件名中删除非法字符但留空格 - Remove illegal characters from a file name but leave spaces Ruby 正则表达式计数字符 - Ruby regex counting characters 用于计算字符的猪脚本 - pig script for counting characters
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM