[英]How to count the number of bold words and italic words in a markdown syntax file
I've read that bold and italic words can be represented in markdown language by ** bold_text ** and * italic_text *, respectively. 我已经读过,粗体字和斜体字可以用markdown语言分别用** bold_text **和* italic_text *表示。 To have a both bold and italic text at once, you can wrap the text with 4 asterisks for bold and 2 underscores for italic (or vice versa).
要同时具有粗体和斜体文本,可以用4个星号(粗体)和2个下划线(斜体)来包装文本(反之亦然)。
I would like to write a bash script which determines the number of bold words and italic words. 我想编写一个bash脚本来确定粗体和斜体字的数量。 I guess that this comes down to counting the number of double asterisks , single asterisks, double underscores and single underscores.
我想这归结为计数双星号,单星号,双下划线和单下划线的数量。 My question is how to count the number of specific strings like "**" or "__" from a file, so I can know how many bold and italic words there are.
我的问题是如何计算文件中诸如“ **”或“ __”之类的特定字符串的数量,因此我可以知道有多少个粗体和斜体词。
#!/bin/bash
if [ -z "$1" ]; then
echo "No input file specified."
else
ls $1 > /dev/null 2> /dev/null &&
echo $(cat $1 | grep -o '\<**>\' | wc -c) || echo "File $1 does not exist."
fi
Example input file: 输入文件示例:
**This is bold and _italic_** text.
Expected output: 预期产量:
Bold words: 5 Italic words: 1 Bold and italic words: 1
A few assumptions: 一些假设:
__
, italic uses *
(even though it might also be **
and _
) __
,斜体使用*
(即使它也可能是**
和_
) _
or *
, or lists with leading *
that throw our count off _
或*
,或带有前导*
列表,这些都无法算作 Now, to count bold words, we can use 现在,要计算粗体字,我们可以使用
grep -Po '__.*?__' infile.md | grep -o '[^[:space:]]\+' | wc -l
This looks for anything between two pairs of __
. 这会在两对
__
之间寻找任何东西。 I used the Perl regex engine ( -P
) to enable non-greedy matching ( .*?
); 我使用Perl正则表达式引擎(
-P
)启用非贪婪匹配( .*?
); otherwise, something like __bold__ not bold __bold__
would be just one match. 否则,类似
__bold__ not bold __bold__
东西将只是一场比赛。 -o
returns just the matches. -o
仅返回匹配项。
The second grep matches the words: any sequence of one or more non-space characters; 第二个grep匹配以下单词:一个或多个非空格字符的任何序列; and
wc -l
counts the lines of output. wc -l
计算输出行数。
The same works for italics: 斜体字也适用:
grep -Po '\*.*?\*' infile.md | grep -o '[^[:space:]]\+' | wc -l
To combine these (for bold and italic), the command lists have to be combined. 要组合这些(粗体和斜体),必须将命令列表组合。 For italic inside bold:
斜体字为粗体:
grep -Po '__.*?__' infile.md | grep -Po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l
and bold inside italic: 粗斜体:
grep -Po '\*.*?\*' infile.md | grep -Po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l
Now, a real markdown file might have a few extra surprises (see "Assumptions"): 现在,真正的降价文件可能会带来一些额外的惊喜(请参阅“假设”):
* List item with **bold word**
Line with **bold words and \* an escaped asterisk**
Here is an *italicized* word
And *italics with a **bold** word inside*
And **bold words with *italics* inside**
Code can have tons of *, ** and _ and we want to ignore them all
Also `inline code can have * and ** and _ to be ignored`, right?
which would render as 这将呈现为
- List item with bold word
带有粗体字的清单项目
Line with bold words and * an escaped asterisk
带有粗体字的行和*转义的星号
Here is an italicized word
这是一个斜体字
And italics with a bold word inside
斜体字里面有一个大胆的词
And bold words with italics inside
内含斜体的粗体字
Code can have tons of *, ** and _ and we want to ignore them all
Also
inline code can have * and ** and _ to be ignored
, right?inline code can have * and ** and _ to be ignored
,对吗?
One approach to clean up something like this up would be a sed script: sed脚本是一种清理此类事件的方法:
/^$/d # Delete empty lines
/^ /d # Delete code lines (start with four spaces)
s/`[^`]*`//g # Remove inline code
/^\* /s/^\* (.*)/\1/ # Remove asterisk from list items
s/\\\*//g # Remove escaped asterisks
s/\\_//g # Remove escaped underscores
s/`[^`]*`//g # Remove inline code
s/\*\*/__/g # Make sure bold uses underscores
s/(^|[^_])_([^_]|$)/\1\*\2/g # Make sure italics use asterisks
with the following result: 结果如下:
$ sed -rf md.sed infile.md
List item with __bold word__
Line with __bold words and an escaped asterisk__
Here is an *italicized* word
And *italics with a __bold__ word inside*
And __bold words with *italics* inside__
Also , right?
Ready for consumption by the commands from the first section. 准备好通过第一部分中的命令使用。
Everything together in a script that takes the markdown file name as an argument: 脚本中的所有内容都以markdown文件名作为参数:
#!/bin/bash
fname="$1"
tempfile="$(mktemp)"
sed -r '
/^$/d
/^ /d
s/`[^`]*`//g
/^\* /s/^\* (.*)/\1/
s/\\\*//g
s/\\_//g
s/`[^`]*`//g
s/\*\*/__/g
s/(^|[^_])_([^_]|$)/\1\*\2/g
' "$fname" > "$tempfile"
bold=$(grep -Po '__.*?__' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l)
italic=$(grep -Po '\*.*?\*' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l)
both=$((
$(grep -Po '__.*?__' "$tempfile" |
grep -Po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l)
+
$(grep -Po '\*.*?\*' "$tempfile" |
grep -Po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l)
))
rm -f "$tempfile"
echo "Bold words: $bold"
echo "Italic words: $italic"
echo "Bold and italic words: $both"
Which can be used like this: 可以这样使用:
$ ./wordcount infile.md
Bold words: 14
Italic words: 8
Bold and italic words: 2
my solution is to change the ** to another thing to make the problem easier. 我的解决方案是将**更改为另一件事,以使问题更容易解决。
I choosed ~, you can replace it to something else 我选择了〜,您可以将其替换为其他东西
$ cat test
**bold**
*italic*
**bold**
sed 's/\*\*/~/g' test
~bold~
*italic*
~bold~
Now for bold ones you should count the number of ~ and finally divide it by 2 count the number of ~ 现在,对于粗体字,您应该计算〜的数量,最后将其除以2计算〜的数量
$ cat test | tr -d -c '~'
~~~~
$ cat test | tr -d -c '~' | wc -c
4
now divide it by 2, first save the output in a variable. 现在将其除以2,首先将输出保存在变量中。
$ bold=`cat test | tr -d -c '~' | wc -c`
$ expr $bold / 2
2
Do the similar things for the italic one. 对斜体执行类似的操作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.