User:Riviera/Granular document production with free software

From XPUB & Lens-Based wiki

This wiki post details how I am harmonising the shell, Pandoc and ConTeXt to design typeset PDFs.

Through the implementation of metadata files and extensible templates, Pandoc offers a streamlined means of writing ConTeXt documents. The following .yml file:

title: Some Document

can be passed to Pandoc for use with a ConTeXt language template. For example, to specify the title of a pdf document in ConTeXt, the \setupinteraction command is used. Pandoc deploys this command in the template and inserts the value of title from the .yml file where necessary. Furthermore it does this conditionally: only when title is given in a yaml format and passed to Pandoc as metadata. There are more complex options than title alone. Here’s another example:

layout: |
  backspace=0.15\paperwidth,
  width=0.5\paperwidth,
  rightmargin=0.3\paperwidth,
  topspace=0.05\paperheight,
  height=0.9\paperheight

In this scenario, the layout block is mapped to the content of the \setuplayout ConTeXt command. The command appears in the Pandoc template as follows:

$if(layout)$
\setuplayout[$for(layout)$$layout$$sep$,$endfor$]
$endif$

This sort of interface is very convenient for designing ConTeXt documents. A main reason for this is that Pandoc overwrites files by default. Therefore a level of caution needs to be taken when running Pandoc. Here is where shell scripts and filesystem design come into play.

${project_dir}

The diagram below illustrates the directory structure for a project with a Pandoc-based, markdown-to-context workflow. It could be adapted for different input and output formats.

.
├── bib/
├── img/ -> /home/riviera/Pictures/project/
├── markdown/
├── script
└── tex/

${project_dir}/bib/
The bib directory contains two files.

.
├── project.bib -> /home/riviera/Bibliography/project.bib
└── project.csl -> /home/riviera/Bibliography/csl/OU-harvard-annotated.csl

These files are symbolic links to a BibTeX file and a citation style language file. I keep the BibTeX file up to date automatically with Zotero. CSL files are XML documents which describe how citations and bibliography entries should appear in the text. Pandoc utilises Citeproc to format bibliographic data. In markdown documents, for example, pandoc will recognise [@smith2007] as an inline reference to the corresponding item in the BibTeX file. The bibliography and CSL files can be specified the .yml metadata file. This allows for a flexible design. For example, if a metadata file was provided for each text, a bibliography could appear at the end of each section of a book.

${project_dir}/markdown/

Depending on the project there are different ways to structure the markdown directory to get the most out of Pandoc. In speculative structure #1 each text is called text.md and appears in a dedicated directory. This is to avoid naming conflicts and streamline shell scripting. I will return to shell scripting in more detail later.

Speculative Structure #1

.
├── all-texts-below-concatenated-in-one-document.md
├── text-one
│   └── text.md
├── text-two
│   └── text.md
└── text-three
    └── text.md

It should be noted that this structure is not the structure I am implementing in the project. Rather, I am implementing a second structure where each text has its own metadata.yml file associated with it. This is useful for implementing various section-based typographical features. For example, running headers corresponding to the title of the section and the names of the authors can be automatically generated in this way. Moreover, I suspect the footnote counter could be more readily reset at the end of each section instead of running continually throughout the document. Thirdly, section-based bibliographies can be implemented in this way. To achieve this, the structure is complemented by a section.context pandoc template.

Speculative Structure #2

.
├── section.context
├── text-one
│   ├── metadata.yml
│   └── text.md
├── text-two
│   ├── metadata.yml
│   └── text.md
└── text-three
    ├── metadata.yml
    └── text.md

metadata.yml

Let’s take the example of text one; here’s what the metadata.yml file could look like:

lang: "en"
title: &title A document
subtitle: &subtitle with a subtitle
author: &au
  - &rt Riviera Taylor
date: &date 2012
headertext:
  - *title
  - *au
synopsis: |
  Here is some text

The & expressions in the yaml file are anchors whilst the * expressions are aliases; ways of referencing the anchors. This data is passed to Pandoc, but more significantly a template can be written to influence what Pandoc does with this data. Thus

section.context

I alluded to the pandoc template file previously; I will not reproduce the entire document here. Instead I highlight a particular aspect of the metadata file to illustrate how extensible Pandoc templates are:

$if(synopsis)$
\midaligned{\it Synopsis}
\startnarrower[2*middle]
$synopsis$
\stopnarrower
\blank[big]
$endif$

This code block implements a conditional statement. When working with this template file, Pandoc will check if there is a synopsis block in the yaml metadata. If so, it inserts specific ConTeXt commands into the output file. These commands print the word synopsis in italic text, narrow the margins and print the contents of the synopsis block.

${project_dir}/tex/
project.mkxl looks like this:

\environment{../env/project-env.mkxl}
\starttext
\input{text-one.mkxl}
\pagebreak
\input{text-two.mkxl}
\pagebreak
\input{text-three.mkxl}
\pagebreak
\input{colophon.mkxl}
\stoptext

First, ConTeXt is instructed to import a variety of environment settings: commands pertaining to layout, typography and PDF metadata. Then, each section of the text is typeset. The aim here is to write the least amount of ConTeXt possible by designing a DRY, harmonious and functional structure.

Executing Pandoc via a Shell Scripting

I’m using a shell script to run batch jobs. First, some variables need to be initialised:

#!/usr/bin/env bash

project_name="reader"
draft=3
project_dir="/home/riviera/XPUB/trimester-3/${project_name}/draft_${draft}"
bibliography_file="${project_dir}/bib/${project_name}.bib"
csl_file="${project_dir}/bib/${project_name}.csl"
input_format="markdown"
output_format="context"

Then several functions are defined which do particular tasks. This might involve running pandoc on each markdown file to generate the context for each document. It’s advisable to replace all values with variables when writing scripts. This enhances flexibility as the value of variables can be altered and referred to consistently throughout the script.

generate_context_for_each_text() {
    pushd ${project_dir}/${input_format}/ 1> /dev/null
    local directories=$(ls -d */ | tr -d "/") # remove trailing /
    for title in ${directories};
    do
    mkdir -p ${project_dir}/tex/raw/${title}/
    pandoc -f ${input_format} -t ${output_format} \
           --metadata-file=${title}/metadata.yml \
           --template=./section.context \
           --bibliography=${bibliography_file}\
           --csl=${csl_file} \
           --citeproc \
           -o ${project_dir}/tex/raw/${title}/tex.mkxl \
           ${title}/text.md
           # -o ${project_dir}/${output_format}/raw/${title}.mkxl
    done
    popd 1> /dev/null
}

$1

As this script only contains one function the function declaration is somewhat redundant.