Pandoc + Markdown for Conversation Analytic Transcripts

7 Comments / blog / March 6, 2015 / bibtex, geek, howto, Markdown, Pandoc

A while back I wrote a blog post detailing why I chose Pandoc and Markdown to write papers including Jeffersonian Conversation Analytic transcripts. It wasn’t very detailed though, because a full explanation of how to set up a compatible text-based writing workflow was an onerous task – one happily now completed beautifully by Dennis Tenen and Grant Wythoff’s guide to Sustainable Authorship in Plain Text using Pandoc and Markdown.

So, I decided to update this how-to for anyone using Pandoc and Markdown to start including CA style transcriptions quickly and easily.

To go along with this how-to, there is also a set of demo files you can download to try out this approach. However, before you do that you probably want to get a pandoc + markdown setup installed.

The Problem

There are great software tools out there for CA-style transcription, my favourite is CLAN for a number of reasons. However, I can’t find any resources online about how to publish CA-style transcriptions without being forced through some eye-bleeding LaTeX diddling every time.

Of course I could just use a WYSIWYG text editor like LibreOffice – but now I’ve experienced the power of LaTeX for document preparation and publication, I really can’t see myself going back.

When doing CA it seems particularly important to have transcriptions legibly in the body of the paper and visible during the writing process, because many of the analytical observations come, or get significantly modified at the point of writing about them, double and triple checking assumptions, and cross-referencing with the CA literature while tweaking citations.

The Simplest Solution: Markdown + Pandoc

Markdown is my favourite lightweight markup language, a highly readable format with which you can write a visually pleasing text file, which you can then convert into almost any other format – HTML, OpenOffice, LaTeX, RTF, etc. using Pandoc. There are many similar systems, notably reStructuredText and Textile, all of which you can use to write your text file, and other conversion tools/toolsets, but in my experience, Markdown and Pandoc are the most useful combination in an academic context ¹.

There are lots of great things about markdown:

Just edit simple text files – no weird file formats to get corrupted or mangled.
Less verbose and complicated-looking than LaTeX.
Small files are easy to share/collaborate on with others (everyone gets to use their favourite editor).
There are some great pandoc plugins for my favourite text editor vim.

However, the best thing is that, used along with the XeTeX typesetting engine, it solves the problem with CA transcriptions being unreadable in LaTeX/pdflatex.

For example, in my first CA-laced paper, my transcriptions looked like this in my LaTeX source:

\begin{table*}[!ht]
\hfill{}
\texttt{
  \begin{tabular}{@{}p{2mm}p{2mm}p{150mm}@{}}
 & D: &  0:h (I k-)= \\
 & A: &  =Dz  that  make any sense  to  you?  \\
 & C: &  Mn mh. I don' even know who she is.  \\
 & A: &  She's that's, the Sister Kerrida, \hspace{.3mm} who, \\
 & D: &  \hspace{76mm}\raisebox{0pt}[0pt][0pt]{ \raisebox{2.5mm}{[}}'hhh  \\
 & D: &  Oh \underline{that's} the one you to:ld me you bou:ght.= \\
 & C: &  \hspace{2mm}\raisebox{0pt}[0pt][0pt]{ \raisebox{2.5mm}{[}} Oh-- \hspace{42mm}\raisebox{0pt}[0pt][0pt]{             \raisebox{2mm}{\lceil}} \\
 & A: &  \hspace{60.2mm}\raisebox{0pt}[0pt][0pt]{ \raisebox{3.1mm}{\lfloor}}\underline{Ye:h} \\
  \end{tabular}
\hfill{}
}
\caption{ Evaluation of a new artwork from (JS:I. -1) \cite[p.78]{Pomerantz1984} .}
\label{ohprefix}
\end{table*}

which renders this:

A simpler way to do this in Markdown (with none of the fancy stuff) is to use Markdown’s ‘verbatim’ environment – you do this by putting four spaces or one tab before each line in your transcript (including blank lines). Here’s the messy LaTeX above re-done in simple Markdown.

(3)

STE:        U̲o̲:̲h̲ oh ugly things [he paints.] 
KAT:                            [Really?] 
        (3.0) 
STE:        (°I think s[o-])°
KAT:                   [So you wouldn't sell any?] 
STE:        U̲u̲h̲ n[o] 
KAT:              [No?] 
        (1.7)

which renders like this:

Overall, I think the Markdown version represents a significant improvement in legibility while writing. I think it might be possible to do the same in LaTeX using the {verbatim} environment, but the fact that Markdown also lets me concentrate on writing without throwing errors or refusing to compile lets me spend longer on the writing than on endless text-fiddling procrastination.

When it comes to rendering, I feed my markdown file to pandoc:

$ pandoc --latex-engine xelatex --bibliography library.bib --csl default.csl -N -o  paper_title.pdf paper_title.markdown

If you want to use the nicely stretched ceiling characters for overlap marking, or the raised full stop / bullet operator for inbreaths, you can do so, but you’ll need to run Pandoc (see below) referencing a font that has those characters. For example, you could use CAfont and add:

--variable monofont=CAfont

to the pandoc command above.

The default.csl file is a citation style language file to customise how bibliographical references are rendered.

If you’re only adding a few examples to your document, this will probably work fine. If you are writing a thesis or a longer document – read on.

For Longer Texts: Markdown + Pandoc + LaTeX

The above approach may work for writing a short paper with one or two examples, for a thesis or a longer piece where you may have many examples, you’re going to have to take this a step further and use some LaTeX within your Markdown document. The bad news, you will have to use LaTeX, templates and some code to deal with:

Example Layout: you probably want your examples to be graphically separated from your text in a consistent way.
Document layout: you may need to make some stylistic tweaks to how your document prints out.
Referencing: you will want to use labels for your examples so you can cross-reference them automatically within the text and not have to re-label them every time you make a change.
Audio/video links: you may want to include links to audio/video examples in your files.

The good news: your CA transcript examples will still be easy to read/edit, and actually this is all pretty straight forward once you’ve got it set up.

What you will need

First, you need a working Pandoc + Markdown setup installed. You also need a nice monospaced font installed – I use CAfont by the amazing CHILDES project.

I’ve made a downloadable archive of the three files I use every time I create a new document. Download those. There is also a working demo (README.md) and some image files that you can use to edit/test things, or modify them to create your own.

Along with these examples inside the camarkdown_files folder you will find:

template.txt: a LaTeX template that Pandoc uses when it renders PDFs – with macros etc.
apa.csl: a citation style language file describing how I want my APA citations rendered.
margins.sty: a little margins file I canuse to tweak the overall page layout separately (US Letter vs. A4 etc.)

Whenever you start a new document, these three files into the same folder.

A little explanation

Without getting too geeky about it, here’s a little explanation of how I use this setup:

Whenever I convert my Mardown to PDF using Pandoc, I add:

--template template.txt

to the pandoc command to make sure it uses this template. The template is based on the default LaTeX template Pandoc always uses to convert Markdown to PDF via LaTeX, but I’ve added a macro: caextract.

Basically the caextract environment sets the default monospaced font, and (optionally) creates a to an online media file referenced in the Markdown file (see working example below), it also formats the paragraph containing the example as a framed float to divide it from the body of the text, and changes the listings name to ‘Extract’, so references list it as ‘Extract 1’ rather than ‘Figure 1’.

Here’s the relevant bits from the header section of template.txt

    \newcommand{\medialink}[2] { \begin{flushright} \href{#1}{#2}\\ \end{flushright} 
    } 
    $if(highlighting-macros)$
    $highlighting-macros$
    $endif$
    $if(verbatim-in-note)$
    \usepackage{fancyvrb}
    $endif$
    \usepackage{listings}
    \lstnewenvironment{extract}[1][]{
        \renewcommand*{\lstlistingname}{Extract}
        \lstset{frame=single,basicstyle=\small\ttfamily,keepspaces=true,#1}
    }{}

And this bit goes into the main section of the template:

    \usepackage{float}
    \floatstyle{ruled}
    \newfloat{caextract}{htp}{lop}
    \floatname{caextract}{Extract}

A working example

Here is a full example from a paper I’m writing at the moment that you can tweak and play with. It’s all done in simple markdown, using a little bit of LaTeX embedded within the Markdown file to call the macro.

So where I want my extract to appear in my Markdown file, I add:

![Different stopping postures between dancers \label{stopping-postures}](images/stopping-postures.png)

\begin{caextract}[H]
\caption{See https://www.dropbox.com/s/jnpf5pnxcy4dg8m/lexical-features.mov}
\label{lexical-features}
\begin{small}
\begin{verbatim}

1  JIM:   ∙hhh ⌈opps sorry Hh hyeh °hyour head°, ∙HHh Hmhmhmhmhm hehheh
2  TEA:        ⌊YE::AH! KAY >>LET's TRY it AGAIN< FIve, (.) s⌈ix? (.)
3  TEA:   ↓⌈five six se::v⌉en eight? Rock st⌈ep. (.) tri:ple, (.) tri:ple.  ⌉
4  JIM:    ⌊°five six shh°⌋                 ⌊°ep (.) tri:ple, (.) tri:ple.°]⌋
5  TEA:   G O :̲ ̲:̲ ̲O̲ :⌈ : d! L̲o̲v̲e̲l̲y̲⌈̲::.    (.)    ⌉ OKA::Y!
6  JIM:              ⌊O:hhkay:̲:̲? °Hm ↑hmhmhmhmhm°⌋
7          (1.3)
8  TEA:   LETS ROTATE PA:RTNERS!

\end{verbatim}
\medialink{https://www.dropbox.com/s/e960eu94ji7ncn3/lexical-features.mov}{Watch}

\end{small}
\end{caextract}

That should render something like this:

A later paragraph refers to the figure like so:

By contrast, Sara, Paul and Anne - marked in red in figure 
\ref{stopping-postures} - step back, split their weight and 
stop dancing together with the onset of Teacher's 
"\verb|G O :̲ ̲:̲ ̲O̲ : : d!|". Without having space to analyse 
this method, it is worth noting in closing that the regularity 
of these methods and their interactional contingencies are 
shown in the [slow-motion sections of the video](https://www.dropbox.com/s/jnpf5pnxcy4dg8m/lexical-features.mov) 
by how dancers who stop like Jim are all pulled off balance 
by dancers who stop like Paul, Sara and Anne.

It should look something like this:

A few notes on how this works:

The main reason for the macro is to enable cross-referencing. In the Markdown file, within each caextract I use \label{my-label} to label my examples. Then I can reference them anywhere in my Markdown file with something like “See extract \ref{my-label}”.
If you don’t have any media, just leave out the \medialink line.
You can put anything in the \caption section – your example name if you have a set naming schema for your corpus.
Note the neat Markdown trick in the paragraph above: I use “\verb|This comes out verbatim|” for a short inline bit of monospaced text.

Rendering your CA extracts using Pandoc

Finally, making sure you have your csl file (apa.csl), your images, your template.txt file and your margins.sty file all in the same folder with your example (I find that convenient), and making sure you have a nice monospaced font to use (CAfont is great) in place, run something like this:

pandoc --latex-engine xelatex --csl apa.csl --variable monofont=CAfont --variable mainfont=Arial --variable fontsize=12pt -H margins.sty --template template.txt --bibliography /path/to/library.bib -o README.pdf README.md

You can, of course, run this command from the terminal – swapping out the relevant variables as needed, but I use vim-pandoc’s PandocRegisterExecutor function to run this whenever I type the local leader character twice (,,) followed by pdf. See https://github.com/vim-pandoc/vim-pandoc for documentation of that kind of thing.

I’m happy to answer any questions here or on @saul.

Notes:

Not all of these systems support bibliographical references with BibTeX – Markdown + Pandoc does this quite elegantly ↩

7 thoughts on “Pandoc + Markdown for Conversation Analytic Transcripts”

Adam
November 25, 2015 at 7:59 pm

Hi Saul,
Thanks so much for developing these resources and passing them on! I’m currently writing a CA thesis and using the template you’ve provided here. So great. I wonder if you might be able to help with one adjustment I’m trying to make?

I have a section were I present multiple short extracts to show variation in practices used for propositions. (Think: Gail Jefferson, and how she sometimes presented numerous segments at one time) I’m trying to adjust the caextract environment in your template to not put in the lines that frame the extract and the caption. I found that one can change the ‘\floatstyle’ from ‘ruled’ to ‘plaintop’, but this centers the caption. I wonder if you might have any ideas about how I might specify a parameter in the latex call to either a) adjust the plain caption back to the left, or b) remove or replace the lines in some other fashion.

If you don’t happen to have any ideas for a workaround on this one, that’s fine, of course. Happy to share whatever solution I find, if you care for it.

Many thanks.
saul
November 25, 2015 at 8:29 pm

Hey Adam,

You’re so welcome. I’m thrilled that someone else is using this approach.

I understand what you’re trying to do, but I’m not sure I have a solution for you I’m afraid.

I’ve not had this issue – where I’ve had multiple segments to present, I’ve just added them all to one caextract environment, and just numbered them within the verbatim text (1) (2) (3) – which is kind of old-school anyway, then made a combined caption for all of them.

I can have a look at this tomorrow and see what I can do – but for the moment you might want to try the advice in this tex stackexchange:

http://tex.stackexchange.com/questions/91282/algorithm-caption-should-be-on-top-and-and-flushed-left

Let me know if that works – if not – I’ll have a look when I can get to my workstation tomorrow.

And best of luck with the thesis. I’d be really curious to see/hear more about it.

Best,

Saul.
Adam
November 26, 2015 at 4:30 pm

Thanks so much for the prompt reply, Saul!

There might be something helpful in the thread that you linked me to (I actually came across it before I posted here), but I’m very new to laTex and I’m not sure how to incorporate the code into the template you developed. Along these same lines, I found out that you can define a new float style, and I tried to do this with your template but wasn’t sure where I should add the code so that it works with the macro. I experimented with this strategy a bit, trying to alter the framing used by the environment, but pandoc kept throwing me error codes.

I’m trying the multiple-segments-in-a-single solution now. This works, but now I have two remaining issues. 1) change or remove (for floats containing multiple segments only) the “Excerpt #” that prefaces the caption. 2) basic formatting for the introductory lines to each excerpt in the section. Really, the only thing here is to make some text bold so that these lines are consistent with the single excerpts elsewhere in the text.
Another feature that would be nice, but is not necessary, is 3) is there a way to maintain indexing for each separate excerpt?

**Another potential solution:** Pandoc will now keep track of examples if you use the syntax, ‘(@).’ This means there might be a way for us to keep it all in markdown, while still utilizing an reference system for examples. But, I can’t get the ‘four space rule’ for verbatim environments to kick in when coupled with this reference system. One other thing for this is that it would be nice to use a full horizontal rule to break things up a bit, but markdown doesn’t seem to be able to do that as well.

The thesis is drawing from Doug Maynard’s project analyzing the testing and diagnosis of Autism Spectrum Disorder with children. I’m looking at practices for proposing the test as ‘play’ and ‘games’ in service of transitioning into mutual engagement in testing.

I’m likewise very interested in your work on interactional practices for arriving at aesthetic appraisals (is that a fair summary?). I’m still steeped in readings on transitions, games, and directive response sequences right now, but I look forward to reading your work when I’ve worked my way through the queue!

Adam
saul
November 26, 2015 at 11:55 pm

Hi Adam,

Great to know you’re doing such interesting work with Doug. I look forward to reading it!

If you can send me a small example that compiles and demonstrates your original problem, and any related BibTex I can see what I can do with it. I’m no expert but I think I can fiddle with the restylecaption option in template.txt.

However, I’m away from my desk for a few days so won’t be quick with that unless it just works (and it never does) :) In response to your workaround ideas:

Your potential solution won’t work as described I’m afraid, the verbatim environment won’t play nicely with any internal text styling. However, you can still work around this in a few different ways.

> 1) To change or remove (for floats containing multiple segments only) the “Excerpt #” that prefaces the caption.

If you define a second environment copying the definition of caextract but called something else in template.txt, you can change the label preface (check how I defined it as Extract in the template), including just removing it by defining it as whitespace (a tab or space character I think). However, this will start a new reference structure that will begin numbering at 1. I doubt you want that so the best way to deal with this may be to fudge it by just changing the caextract label “Extract” to “Example” or something else that sounds non committal about whether it’s a single or multiple source example.

> 2) basic formatting for the introductory lines to each excerpt in the section. Really, the only thing here is to make some text bold so that these lines are consistent with the single excerpts elsewhere in the text.

You can fudge emphasis and underlines using special ASCII characters (e.g. use the underline character easily available in CLAN to underscore text that you can then copy and paste into the verbatim environment). This won’t look ideal but it will provide emphasis of sorts. Again, you could change the basic definition of caextract to use underline instead of bold in caption headings for consistency.

> 3) Another feature that would be nice, but is not necessary, is 3) is there a way to maintain indexing for each separate excerpt?

The practical work around I’ve used is to just fudge it with a single caption which you can then reference and manually add a subreference to each one like so: “see extract \ref{yourextractlabel}(1) and \ref{yourextractlabel}(2)”

Good luck with this, let me know how you get on – you can email me the example markdown/BibTex on saul.albert@eecs.qmul.ac.uk
Hayden
March 5, 2019 at 6:30 am

Hi Saul, thanks so much for this post! You’re really convincing!

I have a question about LaTeX that you may be able to answer… I am thinking of using LaTeX to write up my CA excerpts (which will have three lines: Chinese pinyin, Chinese characters and English translation, the latter two which I have been adding in Word post-CLAN). However I want to keep using Word for the bulk of my writing because my supervisors don’t use LaTeX (and I feel Word is good for commenting/track changing?) and some of the good journals I want to try and submit to (e.g. ROLSI) only accept Word doc submissions from what I can see. As someone more familiar with LaTeX though, please let me know if I’m wrong – I think it would be better to get everything in LaTeX in the end!

My question is this: is it possible (or worthwhile) using LaTeX just to format and gloss data excerpts initially produced in CLAN (of which I will probably only have 10-20 long ones of in the whole thesis) and then paste back into Word using the LaTeX source code (I’m not sure how I would do this but I am guessing it’s somehow possible). My concern is that Word would muddle up the original LaTeX formatting and so I would have to end up playing around with it in Word even after producing the excerpts in LaTeX!

Looking forward to your reply,
Hayden
saul
March 5, 2019 at 10:11 am

Hi Hayden,

Unfortunately I’ve never found a good solution to collaborating on a LaTeX document with people who are committed to using track changes in Word (which are, I accept, very useful). I have used https://www.overleaf.com/ successfully with someone who wasn’t very used to LaTeX – but they were themselves very advanced programmers and found it easy to learn. My experience with trying to force non LaTeX users to use it for the sake of a collaboration – or swapping back and forth using PanDoc between Word and LaTeX versions is that you get lots of confusion and friction… it’s just not worth it. Especially if it’s your supervisor, I’d recommend going with whatever works for them – and adopting their workflow. When it comes to writing up a longer manuscript after each chapter has been signed off by your supervisors, that would be the time to use a LaTeX workflow.

The important thing is to keep your transcripts in CLAN (using Unicode underlines etc.) so that you can copy/paste them into Word or Scrivener or a LaTeX document at a later stage.

All the best,

Saul.
Hayden
March 7, 2019 at 11:47 pm

Hi Saul

That’s really helpful advice – thanks so much :).

Hayden