简体   繁体   中英

How to get pandoc lua filter avoid counting words using this pattern in code blocks inside Rmarkdown file?

This is a follow-up question to this post . What I want to achieve is to avoid counting words in headers and inside code blocks having this pattern:

```{r label-name}
all code words not to be counted.
```

Rather than this pattern:

```
{r label-name}
all code words not to be counted.
```

Because when I use the latter pattern I lose my fontification lock in the Rmarkdown buffer in emacs, so I always use the first one.

Consider this MWE:

MWE (MWE-wordcount.Rmd)

# Results {-}

## Topic 1 {-}

This is just a random text with a citation in markdown \@ref(fig:pca-scree)).
Below is a code block.

```{r pca-scree, echo = FALSE, fig.align = "left", out.width = "80%", fig.cap = "Scree plot with parallel analysis using simulated data of 100 iterations (red line) suggests retaining only the first 2 components. Observed dimensions with their eigenvalues are shown in green."}
    
   knitr::include_graphics("./plots/PCA_scree_parallel_analysis.png")
```

## Topic 2 {-}

<!-- todo: a comment that needs to be avoided by word count hopefully-->

The result should be 17 words only. Not counting words in code blocks, comments, or Markdown markups (like the headers).

I followed the method explained here to get pandoc count the words using a lua filter. In short I did these steps:

  1. from command line:

    mkdir -p ~/.local/share/pandoc/filters

  2. Then created a file there named wordcount.lua with this content:

     -- counts words in a document words = 0 wordcount = { Str = function(el) -- we don't count a word if it's entirely punctuation: if el.text:match("%P") then words = words + 1 end end, Code = function(el) _,n = el.text:gsub("%S+","") words = words + n end, } function Pandoc(el) -- skip metadata, just count body: pandoc.walk_block(pandoc.Div(el.blocks), wordcount) print(words.. " words in body") os.exit(0) end
  3. I put the following elisp code in scratch buffer and evaluated it:

     (defun pandoc-count-words () (interactive) (shell-command-on-region (point-min) (point-max) "pandoc --lua-filter wordcount.lua"))
  4. From inside the MWE Markdown file (MWE-wordcount.Rmd) I issued Mx pandoc-count-words and I get the count in the minibuffer.

Using the first pattern I get 62 words.

Using the second pattern I get 22 words, more reasonable. This method successfully avoids counting words inside a comment.

Questions

  1. How to get the lua filter code avoid counting words using the first pattern rather than the second?

  2. How to get the lua filter avoid counting words in the headers ##?

I would also appreciate if the answer explains how lua code works.

This is a fun question, it combines quite a few technologies. The most important here is R Markdown, and we need to look under the hood to understand what's going on.

One of the first step in R Markdown processing is to parse the document, find all R code blocks (marked by the {r...} pattern, execute those blocks, and replaces the blocks with the evaluation results. The modified input text is then passed to pandoc, which parses it into an abstract document tree (AST). That AST can be examined or modified with a filter before pandoc writes the document in the target format.

This is relevant because it is R Markdown, not pandoc, that recognizes input of the form

``` {r ...}
# code
```

as code blocks, while pandoc parses them as inline code that is identical to ` {r...} # code ` , ie, all newlines in the code are ignored . The reason for this lies in pandoc's attribute parsing and the overloading of ` chars in Markdown syntax.¹

This gives us the answer to your first question: we can't! The two code snippets look exactly the same by the time they reach the filter in pandoc's AST; they cannot be distinguished. However, we get proper code blocks with newlines if we run R Markdown's knitr step to execute the code.

So one solution could be to make the wordcount.lua filter part of the R Markdown processing step, but to run the filter only when the COUNT_WORDS environment variable is set. We can do that by adding this snippet to the top of the filter file:

if not os.getenv 'COUNT_WORDS` then
  return {}
end

See the R Markdown cookbook on how to integrate the filter .

I'm leaving out the second question, because this answer is already quite long and that subquestion is worth a separate post.


¹: pandoc would recognize this as a code block if the r was preceded by a dot, as in

``` {.r}
# code
```

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM