Converting LaTeX to HTML: technical notes

I just posted on the OLP that forall x: Calgary now has an HTML version for reading online. Here are some technical notes in case that’s helpful for anyone.

First, LaTeX to HTML conversion has long been tricky. No solution is perfect. There are basically three workable approaches:

  • Assume that your LaTeX code is basically just LaTeX-flavored markup, and use a converter that reads such “vanilla” LaTeX, like pandoc. If your input is simple (or can easily be made simple), that works very well. I use it a lot, e.g., to produce my CV. Pandoc is amazing, but it is intended to be an all-purpose converter between dozens of markup formats, so complex LaTeX projects are beyond its scope. If you are starting from scratch and if you don’t need to rely on special LaTeX packages, then Pandoc would be my tool of choice. (See e.g. Jonathan Weisberg’s Odds & Ends for a wonderful example–the source is even in Markdown with just the formulas in plain TeX code.)
  • Use a package that compiles your LaTeX using LaTeX itself but provides added info in the resulting file, then turn that file into HTML. That’s the approach that TeX4HT and lwarp use. It has the advantage that more LaTeX commands and packages are supported, but (afaict) neither produces good mathematical formulas in the final result: either you get images, or you get the source LaTeX code and rely on MathJax to render it. (Caveat: I have not actually tried these!)
  • The solution I used is LaTeXML. It is basically a reimplementation of the LaTeX kernel, but it outputs to XML instead of to DVI. Because it simulates what LaTeX is actually doing with your code, it can (to a large extent) deal with packages and LaTeX programming directly. It does natively support a large number of popular packages and classes, but packages it does not support can be loaded and “compiled” using the --includestyles flag. But because it directly outputs XML, it can compile formulas directly to MathML. (LaTeXML is what ar5iv uses: a project to compile everything on the arXiv to HTML.)

I just ran LaTeXML on the forall x source code and it almost worked. I just had to comment out a few lines: it wasn’t quite happy with some of the layout commands of the memoir class, there was a sidewaystable and a rotated \iota that made it stumble. But after maybe an hour of trial and error it produced something without a major error and the result was passable. I was especially impressed by the fact that natural deduction proofs produced using Peter Selinger’s fitch package (which is 25 years old and not even on CTAN) were turned into functioning MathML code and came out looking almost the way they were supposed to.

The standard output of LaTeXML doesn’t have any fancy styling. Without further work it looks like a webpage from the early (1990s) WWW. But, fortunately someone else built a wonderful wrapper around LaTeXML that applies a modified Gitbook style to the result: BookML.

Ok, now I was so close that I didn’t really want to stop at “almost good”. There were a few things that I still wanted to fix (and did):

P.D.’s original LaTeX code for forallx included a few custom environments to typeset lists, arguments, a symbolization keys, and some of these did multiple duty in the text (e.g., both numbered lists and unnumbered arguments were coded as the earg environment). I decided to redo these definitions using the enumitem package, which is supported by LaTeXML. That required a bit of search-and-replace of \begin{earg} and \end{earg} throughout the entire source code. The original example environments needed some extra care: They are numbered lists where we refer to the numbers in the text, and the numbering is consecutive within a chapter. The labels were generated using the original forallx.sty command like so \item[\ex{label}]. Those didn’t quite work: the numbers were incremented by 2. First I wanted to just use enumitem to make a new enumerate list with the option resume, but LaTeXML’s implementation of the enumitem package is a bit buggy (newlist doesn’t quite work right). So I decided to just replace all example environments with enumerate, and use two new environments compactlist and numberedlist for enumerated lists that should restart the numbering (i.e., are not example‘s). Labelled items now use the more standard \item\label{label}.

References (with hyperlinks) were working already, but it’s nice to make links to, say, sections, have all of “section V.2” be an active link, not just the “V.2”: easier to click/tap, clearer for screen readers what the link goes to. So I replaced all the \ref‘s with \cref‘s and used the cleveref package, also supported by LaTeXML.

I wanted the proofs to display better, and most importantly, to work for blind users relying on screen readers. (The specific impetus for trying this was a request from the accessibility office from the University of Cincinnati to provide an HTML version for a blind student.) The solution I implemented is described in the accessibility notes. Technically, it required some work though: The fitch package actually produces proofs as a LaTeX array environment, which LaTeXML turned into a MathML array (as it should). But MathML arrays are hard to navigate: it would be better to produce an HTML table instead, where the MathML (i.e., the formulas) just appear in the table cells, so you can navigate from line to line (and, e.g., let your screen reader read out line numbers, formulas, and justifications to you.

So I redefined the relevant bits of fitch to produce not LaTeX’s \begin{array}, & and \\ to indicate array starts, cells and line breaks, but directly produce HTML <table>, <tr> and <td> tags. A bit of CSS then provides the styling of the result (i.e., the scope lines and bars under assumptions). It wasn’t too much more work to also produce the invisible extra info for screen readers (count scope lines, insert markers for “begin subproof” and “end subproof”). These are done by adding some cells (the subproof level counters) and some empty <div>‘s with the aria-label attribute: aria-label="some text" tells a screen reader to say “some text” when it comes across that element on the page, but “some text” doesn’t appear anywhere visually. Warning: The ability to insert raw HTML into the LaTeX source code is actually a feature of BookML, so it doesn’t work if you’re using just LaTeXML. [Update, since the subproof level text is visually hidden anyway, I might just include the subproof start/end announcements directly rather than as aria-labels.]

The insert-raw-HTML-with aria-label strategy also helped with another challenges: the “iff” that’s pronounced by screen readers the same way as “if”. I defined a command that produces <span aria-label="if-f">iff</span>: it looks just like “iff” on the page but a screen reader (should) pronounce it “if-eff”. [Update: it might be easier/better to just put in if&NoBreak;f which renders as “iff” with an invisible non-breaking space between the two ‘f’s but should also be pronounced “if-eff”. But you could still use the aria-label solution if you wanted it to be read out always as “if and only if”. Update^2: I’ll probably remove this in favor of instructions how to customize the screen reader; see this post on overriding screen reader pronunciation.]

ARIA labels are also useful for figures generated by tikzpicture. LaTeXML turns those into inline SVGs, which don’t have a standard way of providing a text description. (<img> tags take an alt attribute, which LaTeXML generates using the tikzpicture option alt={Some text} of core LaTeX, but nothing like this exists for tikzpicture. So I defined an arialabel environment that surrounds a tikzpicture with a <div> tag that does take an aria-label attribute. [Update: this is untested in any screen reader except ChromeVox, and not supported according to the spec: aria-label is only allowed on interactive elements.]

What turned out to be a fair bit of work was actually debugging the CSS: there were a few formatting issues on the HTML side that I couldn’t tell if they were bugs in my code, bugs in LaTeXML, or bugs in BookML. Often it turned out some bit of CSS provided by BookML conflicted with some other bit, so I had to provide some extra CSS to override some things (e.g., BookML made paragraph headings too large because).

The result is good (if I may say so myself) but not perfect. In particular, the MathML output by LaTeXML isn’t quite optimal: many of the symbols we logicians use either aren’t the same type as when they’re used in regular math (e.g., for us an arrow is an operator while for mathematicians it’s a relation), or we would like them pronounced differently (e.g., noone says “left tack” for the provability single turnstile ⊢). This requires another layer: LaTeXML outputs MathML, but then we use MathJax to display the MathML and produce MathML that can be read out with a screen reader. So it’s not clear where to fix things. But that’s a project for future work.

2 thoughts on “Converting LaTeX to HTML: technical notes

Leave a Reply

Your email address will not be published. Required fields are marked *