PTEP — Preprocessing and Text-Embedded Programming

This post is about defining the area of text preprocessing and text-embedded programming, or a bit shorter Preprocessing and Text-Embedded Programming (PTEP). This problems is not recognized in a unified area, but exists in different forms and different contexts. We will first conceptually define the problem domain in this area, and then we will present some ideas about addressing it in a more universal way.

We will start by defining text preprocessing as an operation that takes text as input and produces a similar text as output, and it serves as way to automate manual editing of text. We call it "preprocessing", since there is normally some standard use of that type of text that would be called "processing"; such as compilation or interpretation of a language, rendering of an HTML page and similar. A typicall example of a preprocessor is the C programming language preprocessor, which is mostly used for simple text inclusions or exclusions based on some configuration parameters, and simple text replacements, before the C program code is passed to the compiler.

Text-embedded programming is a related but generally different concept than text preprocessing. We define text-embedded programming as any form of computer programming where code is embedded in an arbitrary text, and can be executed in-place. One of the first examples of text-embedded programming can be considered the TeX typesetting system by Donald Knuth, release in 1978. TeX has its own language of annotating text to prepare it for text typesetting and printing in form of papers, books, and similar; but this language also includes a "macro" language for text transformation in-place before the final preparation of the output pages. This macro part of the language is a form of preprocessing but also text-embedded programming because it is Turing-complete and one could write a general-purpose program in this language. There is a famous example of a table of prime numbers written in the TeX macro language, created by Knuth.

The second, more obvious example of text-embedded programming is the PHP programming language. A PHP program file is usually an HTML file with the snippets of PHP code inserted in the file. The file is processed, or we could say preprocessed, before being delivered to the HTML browser in such way that PHP snippets are replaced with their output produced using the command echo. The snippets are delimited with the strings <?php and ?>, or simply with <? and ?>. This model is particularly convenient for fast development of web apps, where we can start with a static (pure) HTML page and incrementally replace pieces with dynamic PHP-generated code. A similar approach was used in ASP (Active Server Pages) engine or JSP (JavaServer Pages), both of which use <% and %> delimiters for code snippets.

The third example of text-embedded programming is the project Jupyter, formed in 2015, which supports inclusion of Python, Julia, or R snippets in a file called Jupyter Notebook. Although this example of text-embedded programming is not transparent in the sense that a Jupyter Notebook is not a plain text file, it is still very close to a text file, marked in language called Markdown, which gets translated into HTML, and it allows inclusion of arbitrary code in Python (or other allowed language), that can be executed. The result of the execution is shown in the notebook itself. This is a novelty, compared to PHP for example, that we will call the update mode, vs. the replace mode used in PHP, ASP, JSP, and similar template languages.

Starfish: A Universal Approach to PTEP

I would like to propose here a universal; i.e., a general-purpuse preprocessing system for preprocessing and text-embedded programming (PTEP). It is implemented to some extend in the system Starfish. The following goals are some main goals of this universal approach, and the Starfish system in particular:

1. Universal Preprocessing and Text-Embedded Programming: is the goal of creating a universal system that can be used for PTEP for various types of text files (e.g, HTML, LaTeX, Java, Makefile, etc.).

2. Update and Replace Modes: We want to support two modes of operation: replace mode — similarly to PHP or C preprocessor, where the snippets are replaced with the snippet output and the complete output is saved in the output file or produced to the standard output; or update mode — similarly to Jupyter, where the snippet output is appended to the snippet in the updated source file.

3. Flexible PTEP: A goal is to have a flexible system in the sense that the notation for marking the embedded code is customizable by the user. We call any string or string pattern that can be used to identify the code a hook. For example, PHP hooks are <? and ?>, as a prefix and suffix of the embedded code. Changing prefix-suffix hooks is only one form of flexibility, we can have hooks of more forms, such as single string hooks and regular expression hooks, and we also want modifiability of the way snippets are evaluated.

4. Configurable PTEP: To make it more usable in projects, we need a way to customize PTEP at a directory level, and this customization to folow directory hirerarchy to provide recursive sub-directory customization.

5. Transparent PTEP: Our goal is to keep PTEP files transparent in the sense that when the file is processed either with update or replace mode, it can be used directly by it its primary system. For example, an HTML file is still an HTML file viewable by a browser, a LaTeX file is still a LaTeX file processable by LaTeX, and so on. We call this transparency, since a user can open the file with a simple editor and directly view it and edit it. As a comparison, a Jupyter file is a special-format JSON file, which needs to be processed to produce HTML, LaTeX, or other forms usable by a user.

6. Embedded Perl: The main principles described here could be implemented in many languages, I find that Perl is particularly convenient for both Starfish implementation, use in code snippets, and configuration. As a comparison, TeX uses its own language for PTEP and it is difficult to use since its paradigm and notation style are so different from the main-stream programming languages. The C preprocessor works well for C, but in attempts to use it in other systems, like Imake for the Makefiles, it was not very successful and difficult to use since it was not meant for that kind of context. Using a general-purpose language that is used for other purposes has a clear advantage, and Perl succinctness and expressiveness in working with strings in particular, makes it an excellent candidate.

created: 2020-06-21, last update: 2020-06-22, me comments