This post is about defining the area of text preprocessing and text-embedded programming, or a bit shorter Preprocessing and Text-Embedded Programming (PTEP). This problems is not recognized in a unified area, but exists in different forms and different contexts. We will first conceptually define the problem domain in this area, and then we will present some ideas about addressing it in a more universal way.
We will start by defining text preprocessing as an operation that takes text as input and produces a similar text as output, and it serves as way to automate manual editing of text. We call it "preprocessing", since there is normally some standard use of that type of text that would be called "processing"; such as compilation or interpretation of a language, rendering of an HTML page and similar. A typicall example of a preprocessor is the C programming language preprocessor, which is mostly used for simple text inclusions or exclusions based on some configuration parameters, and simple text replacements, before the C program code is passed to the compiler.
Text-embedded programming is a related but generally different concept than text preprocessing. We define text-embedded programming as any form of computer programming where code is embedded in an arbitrary text, and can be executed in-place. One of the first examples of text-embedded programming can be considered the TeX typesetting system by Donald Knuth, release in 1978. TeX has its own language of annotating text to prepare it for text typesetting and printing in form of papers, books, and similar; but this language also includes a "macro" language for text transformation in-place before the final preparation of the output pages. This macro part of the language is a form of preprocessing but also text-embedded programming because it is Turing-complete and one could write a general-purpose program in this language. There is a famous example of a table of prime numbers written in the TeX macro language, created by Knuth.
The second, more obvious example of text-embedded programming is
the PHP programming language. A PHP program file is usually an HTML
file with the snippets of PHP code inserted in the file. The file is
processed, or we could say preprocessed, before being delivered to the
HTML browser in such way that PHP snippets are replaced with their
output produced using the command echo
. The snippets are
delimited with the strings <?php
and
?>
, or simply with <?
and
?>
. This model is particularly convenient for fast
development of web apps, where we can start with a static (pure) HTML
page and incrementally replace pieces with dynamic PHP-generated code.
A similar approach was used in ASP (Active Server Pages) engine or JSP
(JavaServer Pages), both of which use <%
and
%>
delimiters for code snippets.
The third example of text-embedded programming is the project Jupyter, formed in 2015, which supports inclusion of Python, Julia, or R snippets in a file called Jupyter Notebook. Although this example of text-embedded programming is not transparent in the sense that a Jupyter Notebook is not a plain text file, it is still very close to a text file, marked in language called Markdown, which gets translated into HTML, and it allows inclusion of arbitrary code in Python (or other allowed language), that can be executed. The result of the execution is shown in the notebook itself. This is a novelty, compared to PHP for example, that we will call the update mode, vs. the replace mode used in PHP, ASP, JSP, and similar template languages.
I would like to propose here a universal; i.e., a general-purpuse preprocessing system for preprocessing and text-embedded programming (PTEP). It is implemented to some extend in the system Starfish. The following goals are some main goals of this universal approach, and the Starfish system in particular:
1. Universal Preprocessing and Text-Embedded Programming: is the goal of creating a universal system that can be used for PTEP for various types of text files (e.g, HTML, LaTeX, Java, Makefile, etc.).
2. Update and Replace Modes: We want to support two modes of operation: replace mode — similarly to PHP or C preprocessor, where the snippets are replaced with the snippet output and the complete output is saved in the output file or produced to the standard output; or update mode — similarly to Jupyter, where the snippet output is appended to the snippet in the updated source file.
3. Flexible PTEP: A goal is to have a flexible system in the
sense that the notation for marking the embedded code is customizable
by the user. We call any string or string pattern that can be used to
identify the code a hook. For example, PHP hooks are
<?
and ?>
, as a prefix and suffix of
the embedded code. Changing prefix-suffix hooks is only one form of
flexibility, we can have hooks of more forms, such as single string
hooks and regular expression hooks, and we also want modifiability of
the way snippets are evaluated.
4. Configurable PTEP: To make it more usable in projects, we need a way to customize PTEP at a directory level, and this customization to folow directory hirerarchy to provide recursive sub-directory customization.
5. Transparent PTEP: Our goal is to keep PTEP files transparent in the sense that when the file is processed either with update or replace mode, it can be used directly by it its primary system. For example, an HTML file is still an HTML file viewable by a browser, a LaTeX file is still a LaTeX file processable by LaTeX, and so on. We call this transparency, since a user can open the file with a simple editor and directly view it and edit it. As a comparison, a Jupyter file is a special-format JSON file, which needs to be processed to produce HTML, LaTeX, or other forms usable by a user.
6. Embedded Perl: The main principles described here could
be implemented in many languages, I find that Perl is particularly
convenient for both Starfish implementation, use in code snippets, and
configuration. As a comparison, TeX uses its own language for PTEP
and it is difficult to use since its paradigm and notation style are
so different from the main-stream programming languages. The C
preprocessor works well for C, but in attempts to use it in other
systems, like Imake
for the Makefiles
, it was
not very successful and difficult to use since it was not meant for
that kind of context. Using a general-purpose language that is used
for other purposes has a clear advantage, and Perl succinctness and
expressiveness in working with strings in particular, makes it an
excellent candidate.