如何计算Markdown语法文件中的粗体字和斜体字的数量

Question

I've read that bold and italic words can be represented in markdown language by ** bold_text ** and * italic_text *, respectively. 我已经读过，粗体字和斜体字可以用markdown语言分别用** bold_text **和* italic_text *表示。 To have a both bold and italic text at once, you can wrap the text with 4 asterisks for bold and 2 underscores for italic (or vice versa). 要同时具有粗体和斜体文本，可以用4个星号（粗体）和2个下划线（斜体）来包装文本（反之亦然）。

I would like to write a bash script which determines the number of bold words and italic words. 我想编写一个bash脚本来确定粗体和斜体字的数量。 I guess that this comes down to counting the number of double asterisks , single asterisks, double underscores and single underscores. 我想这归结为计数双星号，单星号，双下划线和单下划线的数量。 My question is how to count the number of specific strings like "**" or "__" from a file, so I can know how many bold and italic words there are. 我的问题是如何计算文件中诸如“ **”或“ __”之类的特定字符串的数量，因此我可以知道有多少个粗体和斜体词。

#!/bin/bash

if [ -z "$1" ]; then
    echo "No input file specified."
else 
    ls $1 > /dev/null 2> /dev/null && 
    echo $(cat $1 | grep -o '\<**>\' | wc -c) || echo "File $1 does not exist."
fi

Example input file: 输入文件示例：

**This is bold and _italic_** text.

Expected output: 预期产量：

Bold words: 5
Italic words: 1
Bold and italic words: 1

Answer 1

Simple approach 简单的方法

A few assumptions: 一些假设：

Bold uses __ , italic uses * (even though it might also be ** and _ ) 粗体使用__ ，斜体使用* （即使它也可能是**和_ ）
No "funny stuff" like (inline) code with these characters, or escaped _ or * , or lists with leading * that throw our count off 没有“有趣的东西”，例如带有这些字符的（内联）代码，或转义的_或* ，或带有前导*列表，这些都无法算作

Now, to count bold words, we can use 现在，要计算粗体字，我们可以使用

grep -Po '__.*?__' infile.md | grep -o '[^[:space:]]\+' | wc -l

This looks for anything between two pairs of __ . 这会在两对__之间寻找任何东西。 I used the Perl regex engine ( -P ) to enable non-greedy matching ( .*? ); 我使用Perl正则表达式引擎（ -P ）启用非贪婪匹配（ .*? ）； otherwise, something like __bold__ not bold __bold__ would be just one match. 否则，类似__bold__ not bold __bold__东西将只是一场比赛。 -o returns just the matches. -o仅返回匹配项。

The second grep matches the words: any sequence of one or more non-space characters; 第二个grep匹配以下单词：一个或多个非空格字符的任何序列； and wc -l counts the lines of output. wc -l计算输出行数。

The same works for italics: 斜体字也适用：

grep -Po '\*.*?\*' infile.md | grep -o '[^[:space:]]\+' | wc -l

To combine these (for bold and italic), the command lists have to be combined. 要组合这些（粗体和斜体），必须将命令列表组合。 For italic inside bold: 斜体字为粗体：

grep -Po '__.*?__' infile.md | grep -Po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l

and bold inside italic: 粗斜体：

grep -Po '\*.*?\*' infile.md | grep -Po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l

Cleaning up a more realistic file 清理更真实的文件

Now, a real markdown file might have a few extra surprises (see "Assumptions"): 现在，真正的降价文件可能会带来一些额外的惊喜（请参阅“假设”）：

* List item with **bold word**

Line with **bold words and \* an escaped asterisk**

Here is an *italicized* word

And *italics with a **bold** word inside*

And **bold words with *italics* inside**

    Code can have tons of *, ** and _ and we want to ignore them all

Also `inline code can have * and ** and _ to be ignored`, right?

which would render as 这将呈现为

List item with bold word 带有粗体字的清单项目

Line with bold words and * an escaped asterisk 带有粗体字的行和*转义的星号

Here is an italicized word 这是一个斜体字

And italics with a bold word inside 斜体字里面有一个大胆的词

And bold words with italics inside 内含斜体的粗体字
 Code can have tons of *, ** and _ and we want to ignore them all 
Also inline code can have * and ** and _ to be ignored , right? inline code can have * and ** and _ to be ignored ，对吗？

One approach to clean up something like this up would be a sed script: sed脚本是一种清理此类事件的方法：

/^$/d                           # Delete empty lines
/^    /d                        # Delete code lines (start with four spaces)
s/`[^`]*`//g                    # Remove inline code
/^\* /s/^\* (.*)/\1/            # Remove asterisk from list items
s/\\\*//g                       # Remove escaped asterisks
s/\\_//g                        # Remove escaped underscores
s/`[^`]*`//g                    # Remove inline code
s/\*\*/__/g                     # Make sure bold uses underscores
s/(^|[^_])_([^_]|$)/\1\*\2/g    # Make sure italics use asterisks

with the following result: 结果如下：

$ sed -rf md.sed infile.md
List item with __bold word__
Line with __bold words and  an escaped asterisk__
Here is an *italicized* word
And *italics with a __bold__ word inside*
And __bold words with *italics* inside__
Also , right?

Ready for consumption by the commands from the first section. 准备好通过第一部分中的命令使用。

Putting it all together 放在一起

Everything together in a script that takes the markdown file name as an argument: 脚本中的所有内容都以markdown文件名作为参数：

#!/bin/bash

fname="$1"
tempfile="$(mktemp)"

sed -r '
    /^$/d
    /^    /d
    s/`[^`]*`//g
    /^\* /s/^\* (.*)/\1/
    s/\\\*//g
    s/\\_//g
    s/`[^`]*`//g
    s/\*\*/__/g
    s/(^|[^_])_([^_]|$)/\1\*\2/g
' "$fname" > "$tempfile"

bold=$(grep -Po '__.*?__' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l)
italic=$(grep -Po '\*.*?\*' "$tempfile" | grep -o '[^[:space:]]\+' | wc -l)
both=$((
    $(grep -Po '__.*?__' "$tempfile" |
        grep -Po '\*.*?\*' | grep -o '[^[:space:]]\+' | wc -l)
    +
    $(grep -Po '\*.*?\*' "$tempfile" |
        grep -Po '__.*?__' | grep -o '[^[:space:]]\+' | wc -l)
))

rm -f "$tempfile"

echo "Bold words: $bold"
echo "Italic words: $italic"
echo "Bold and italic words: $both"

Which can be used like this: 可以这样使用：

$ ./wordcount infile.md
Bold words: 14
Italic words: 8
Bold and italic words: 2

Shortcomings 缺点

This can be tripped up by words containing underscores. 包含下划线的单词可能会引起误解。 Some markdown flavours ignore these and assume they're part of the word. 一些降价口味会忽略这些，并认为它们是单词的一部分。
I'm sure I missed a few edge cases in the cleanup 我确定我在清理中错过了一些边缘情况

Answer 2

my solution is to change the ** to another thing to make the problem easier. 我的解决方案是将**更改为另一件事，以使问题更容易解决。
I choosed ~, you can replace it to something else 我选择了〜，您可以将其替换为其他东西

$ cat test
**bold**
*italic*
**bold**

sed 's/\*\*/~/g' test
~bold~
*italic*
~bold~

Now for bold ones you should count the number of ~ and finally divide it by 2 count the number of ~ 现在，对于粗体字，您应该计算〜的数量，最后将其除以2计算〜的数量

$ cat test | tr -d -c '~'
~~~~
$ cat test | tr -d -c '~' | wc -c
4

now divide it by 2, first save the output in a variable. 现在将其除以2，首先将输出保存在变量中。

$ bold=`cat test | tr -d -c '~' | wc -c`
$ expr $bold / 2
2

Do the similar things for the italic one. 对斜体执行类似的操作。

如何计算Markdown语法文件中的粗体字和斜体字的数量

问题描述

2 个解决方案

解决方案1
1 2016-01-17 02:45:55

Simple approach 简单的方法

Cleaning up a more realistic file 清理更真实的文件

Putting it all together 放在一起

Shortcomings 缺点

解决方案2
0 2016-01-16 18:25:19

如何计算Markdown语法文件中的粗体字和斜体字的数量

问题描述

2 个解决方案

解决方案1 1 2016-01-17 02:45:55

Simple approach 简单的方法

Cleaning up a more realistic file 清理更真实的文件

Putting it all together 放在一起

Shortcomings 缺点

解决方案2 0 2016-01-16 18:25:19

解决方案1
1 2016-01-17 02:45:55

解决方案2
0 2016-01-16 18:25:19