Recipes for Research
October 28, 2019
October 28, 2019
Henry Ford is famous for inventing the assembly line, a place where a complex product could be created in a series of repeatable steps. In programming terms, he described an algorithm for producing a motor vehicle. By doing so, he improved the quality and consistency of the final product while at the same time reducing its cost of production.
100 years later, it is information even more than motor cars which drives the progress of society. We have become dependent on complex electronic documents which, like a car, need to be assembled, edited, and used by others in a consistent and repeatable way. Yet no algorithmic processes have been described for document creation, verification, editing, or review.
In this presentation, we’ll explore the idea of looking at the creation of text narratives as a programming process and describe the activities of creating, editing, verifying, and disseminating written documents in algorithmic form.
If a programming language could be developed which described written documents in this way, the art and science of communication could be improved and simplified in the same way as Ford did for automobiles.
Note: This post contains slides and background concepts on the (msl) language by David Bethune. The (msl) expressions discussed are fully explained in the separate MSL Specifications.
Problems Caused by Current Text Systems
All current systems for working with text have two common ancestors. The first is the printing press, attributed to Gutenberg but actually invented by Koreans several hundred years before him. Our current use of precomposed paper documents which are fixed in their content and disconnected from other papers is the direct descendant of the printing process. This isn’t how our minds work: we retain fragments of documents: words, phrases, and ideas; and we connect them to other fragments in a mental map. We don’t visualize or containerize an entire document as one entity. Every paper we see simply refers to other things we already know and then adds some new context or commentary on top of that.
The other ancestor of all electronic communications is the memory and storage systems used by digital computers. Everyone knows that computers reduce everything to “ones and zeros.” There’s no room in a system like that for contextualizing or differentiating one set of bits from another. The same ones and zeros could represent text about Walt Disney, a photo of him, or a history of his company’s stock performance.
The combination of paper-based communications which have no “smarts” and are only loosely correlated with each other and of digital systems with lots of processing power but no idea of what their digital memories contain has resulted in text processing systems which are far behind our needs today.
Benefits of Looking at Text Programmatically
Computer programs are expected to be predictable in their outcomes, traceable, reproducible, debug-able, and extensible in that they can be easily shared with and reused by others. None of these things is true for written documents today. But viewing text editing as a programmatic process could open the door to a new way to use writing in society. It introduces the possibility of creating a programming language specifically for text editing, one which could enable new ways to write, verify, and use documents.
Let’s examine some key programming concepts and see how they apply to research, writing, and text editing. As we go, we’ll build the requirements for a programming language that can represent all the actions in a typical (human) text editing session.
An algorithm is simply a recipe, a process for producing a specified result. The word came into English in the 13th century and is borrowed from the surname of a Persian mathematician, al-Khwarizmi, who began to codify the “recipes” for arithmetic in the 8th century.
What is the recipe for non-fiction writing? Usually, it looks something like this:
- Read a bunch of stuff.
- Mentally connect ideas and facts across documents.
- Notice facts which appear repeatedly and others which are outliers.
- Use existing text to validate your own new hypotheses and observations.
- Draw ideas and conclusions from some things you read, discarding others.
- Make notes and write new text about what you read.
- Share your writing with others, who fold it in at their own step #1.
Of course, the real recipe includes an unknown number of “loops” as we cycle back through the steps in no particular order until we decide the text is complete. Only by following all the original steps could we write an algorithm that produced the same document, along with its references, annotations, and edits.
In 1945, Vannevar Bush was an engineer and inventor who had worked on the Manhattan project. He saw first hand how people worked with complex documents and used them together with their own notes and writing to produce new “a-ha” moments. By attempting to automate the task, he described an algorithmic process for research, though unintentionally.
After we dropped the bomb in WWII, Bush was deeply concerned about the direction that man’s thinking was taking. He proposed a new tool to help mankind “think better.” Called a memex, this was to be a microfilm-based machine that implemented a research algorithm mechanically. Bush’s description of the machine’s behavior allows us to induce the algorithm it must be “running.”
The memex was intended to:
- Record everything you read, photographing it onto a microfilm reel.
- Record everything you wrote, photographing it in parallel with the pages you read.
- Allow sharing a single microfilm frame or an entire reel of research with others.
A machine with nothing more than these three features would still be revolutionary today. In order to create one, we should look at how specific aspects of reading, writing, and research scenarios could be described in modern programming terms. If we’re to define a programming language for text editing recipes, that language will need to emulate these three features of Bush’s design and enable them through code.
Linearity and Chronology
Because a computer’s digital memory is a “blank slate,” the chronological order of operations determines what kind of output it creates. Early video game systems, for example, used cartridge memories. Yanking out the cartridge and putting in a new one would entirely reset the computer to a new state. You may have been playing Tank a few minutes ago but now you’re playing Pitfall, which has nothing to do with Tank and knows nothing about the operation or conditions of that game.
With the earliest computers, jobs were run in batches with a new computing environment being created for each user. Realizing that the hardware was spending most of its time idle, programmers then developed timesharing systems – but each user’s job would still “start over” with a new fresh slate.
Every user’s program consists of instructions which must be followed in a strict order to produce the expected program result. In older languages, these lines were sometimes numbered. Modern programming languages interpret statements in the order in which they appear in a text document, and this gives us a jumping-off point for looking at text editing as a linear program.
A computer’s activities may be based only on what’s in its memory and storage right now, but a human’s activities always contain references to previous work or information. A programmatic view of research, reading, and writing, therefore, must extend across more than one work session. It must incorporate elements that were viewed or edited earlier in the chronology, for this is how we work as human beings.
One aspect of human behavior that’s shared with computers is particularly relevant here. A computer can only operate on instructions in the order in which they are provided. People, too, only operate on text (by reading, editing, annotating, or writing it), in a strict chronological order.
A document consisting of prose or one made up of tables of numbers can be organized as a chronology, even though the facts and figures may have been composed in an order that differs from the final document. Nearly all writing goes through an editing phase where some text is added and other text removed. While the document itself can be read linearly, the final output doesn’t contain any information about this editing process. It doesn’t show what materials were consulted, added, or discarded in the creation and organization of the final text.
A programming language that described text editing would need to act like Bush’s memex machine and “photograph” every piece of reading and writing. Edits would necessarily appear later in the list of instructions, since they were performed later. Just like the memex’s indelible microfilm records, a program that represented text editing would never actually change any existing text. It would simply record editing (or excision) as separate programming language statements that appear later in the recipe. A language which described reading, writing, and editing in this kind of chronological form – necessitating references to existing materials rather than eschewing them – would be an important first step in approaching written communications as programs.
Such a language would also allow “rewinding” text to any previous state. This would be a huge advancement over simple “undo” functions in a word processor or even file versioning. The same file could simultaneously represent all the states of a document’s creation, from its inception to its final form – along with all the references and notes used in between. This simple idea of recording everything, as suggested by Bush, enables all of the other programming magic in our text editing language.
Imperative & Procedural Programming
A simple list of instructions lends itself well to the idea of imperative programming. The key feature of this programming style is that it depends on the idea of a global state which is updated after each instruction. Future instructions rely on the current values from this system state, not on the literal instructions that came before them. This lends an uncertainly to their interpretation because it can be hard to predict the future state from a list of past instructions.
Imperative programming further developed into procedural programming where repetitive actions could be encapsulated into procedures and passed varying arguments. In the presence of a global system state, however, the output of a procedure is indeterminate, further complicating the debugging of these types of programs.
A language for recording research should not depend on a global state that’s unpredictable because that isn’t how we work with written materials in the real world. We are always extracting, moving, deleting, and adding to actual words that we can see, not a hidden system state.
Functional programming was proposed by John Backus at IBM (where he previously invented FORTRAN) as a way to resolve the ambiguities in imperative programming and its reliance on a system state. In the functional style, the output of a function relies entirely on its inputs and not on hidden values from the system state. This corresponds better to the human way of working with documents, each of which combines fragments from others along with new work.
What if imperative and functional programming could be combined so that the language expressions themselves held the “system state?” That way, a function could collect its arguments from previous language statements in the instruction list. Rather than being hidden and unpredictable, the state would be present and easily examined. Any expression could then be resolved into a functional one with all of its arguments fully coerced into literal values.
An ideal text recording language would retain the command-oriented characteristics of an imperative language (“Do this, then do that.”) while at the same time offering the clarity and reliability of a functional language (“Given this, do that.”).
Operators & Operands
Whatever style of programming we use, there is the question of syntax within the individual expressions themselves. From the highest level language down to machine instructions, programming statements consist of two parts: an operator which tells what to do, and an operand which tells what to do it with. Operators are literally wired into a chip’s design while operands are highly variable, so the traditional approach has been to write operator operand in that order, and this applies all the way down to machine language.
But this is the opposite of the way humans approach text. We do not say to ourselves, “I’d like to make something bold. Which something? This text over here.” We start with the operand. “I’d like to take this text over here and make it bold, or “I’d like to take every occurrence of this text anywhere and change the value.” A programming language that was interpreted this way would have a much more natural reading and writing order when it comes to recording and editing text.
If our text recording language were to start with values and then apply changes to them, the entire language might read as a list of instructions in operand operator form. Working this way means giving up access to traditional languages within the same program because it would be difficult or impossible to interpret the results. The language would need to be entirely built on text operands which are then passed the operations to be done to them.
Variable Assignment and Recall
In narrative text, information is communicated by what would in programming terms be called assignment statements. If we write that Walt Disney was born in 1901, we are essentially assigning the value 1901 to a variable called “Walt Disney’s birthday.” In other words, these two forms are equivalent:
“Walt Disney was born in 1901.” ⇒ WALTDISNEY:birthday = 1901
Other kinds of text can also be reduced to assignment statements. The periodic table, for example, shows the atomic number of each element. We could extract this into an assignment statement.
“Hydrogen 1” ⇒ HYDROGEN:number = 1
While every new piece of writing should contain some unique information or insights from the author, the vast majority of what we read and write consists of known facts like these. What day of the week is May 8th? Which drugs has this person taken and what are their interactions? All writing is peppered with facts which can verified independently from the new materials that surround them.
Our text programming language needs some way to indicate when facts like these appear in a sentence. It’s easy to scan a document for common nouns like people, places, and things. If we include in our language an expression to isolate or denote facts, then we can assign the facts to variables, give each variable a value, and recall that value later. We can also easily determine if a later expression is trying to assign a different value to the same variable.
Traditional text comparison tools show the literal differences between two documents, usually line-by-line. But this kind of comparison doesn’t help when the documents are comprised of entirely different text. Recently, it was discovered that Jupiter has 10 more moons than previously known, bringing the total number to 79. Many published documents and web pages will have the old number. A line-by-line comparison of a new paper on Jupiter with one published in 2018 won’t show where the old paper is wrong. But re-casting both documents into a programming language and isolating the variable assignments (JUPITER:moons = 79) would make it easy to review, update, and re-use any older publication that had assigned a different value (JUPITER:moons = 69).
Thinking of facts in narrative text as variable assignments opens up two new possibilities. First, we can recall the value of a variable that was previously set, regardless of when or where it was set. A student might record the periodic table at the start of a class semester and then recall its variable value in a paper written months later. There is no need to re-consult the periodic table document:
LITHIUM:number ⇒ 3
Secondly, we can test a new provided value against the existing value of a variable and correct any errors. The same periodic table loaded by the student at the start of class shows that the atomic number of lithium is 3. If the student were to write in a future paper:
“Lithium, having an atomic number of 7…”
The system could detect that 7 is not equal to 3, the currently stored value for that variable. Values like this which are known to be correct (or “canonical”) could be marked as such and our text programming language will need an expression for doing so. Even many months later and without referring to the periodic table, errors in the student’s writing could be detected and corrected when the new variable assignment is compared with the existing, canonical value. Of course, the same method used to store the properties of the elements could be used to store maintenance requirements on airline parts, for example, and to check these against future documents.
Thinking of factual statements in writing as variable assignments allows sophisticated methods of future recall, verification, and checking. This the first and most important aspect of viewing text editing as a program. Variables are assigned and re-assigned in a chronological order. Variable values which are different from known good values can then be isolated, highlighted, or corrected automatically.
In all but the simplest programs, version control is a huge issue. Today’s systems store each edit to a file as a full copy of the file itself with only its current text represented. This necessitates separate tools to compare files between versions and detect the changes therein.
Comparing text between versions of a narrative document is even more arduous than comparing versions of program code. In legal documents, for example, strike-through text is used inline to show words or sentences which have been removed from the original text. New text is sometimes shown italicized or in a different font. Despite these visual affordances, there is no easy way to identify all the removed text, all the new text, or all the original text and no way to roll-back or review previous versions of a document.
Even in creative writing, authors may write many paragraphs about a character or scene and then discard them, lacking any way to store those previous versions of writing and refer to them later. A programming language which was specialized for recording research and writing would be able to record all of the versions of a document inside that document itself, including other materials which were consulted during writing, the annotations made, and text which was written but not included or was later moved to a different location in the narrative.
When combined with code for variable assignment and recall, as described previously, a version control system for writing would be even more useful. Such a system could detect changes in factual information over time, helping authors to highlight when critical information in their library has changed and alerting them to new interpretations of materials they’ve already consulted.
A hash serves as a way to fingerprint a piece of text or code (or a graphic in a binary file). A hashing algorithm produces the same hash when given the exact same text and an entirely different hash if the text varies by even one character. If the current state of a document is recorded into a hash value and that hash value is also included in the hash of the next round of edits, a chain of edits can be established. This is exactly the principle used in Bitcoin and other cryptocurrencies. The amount of money held by any one address is hashed together with others into a block, and the block hash is also included in the next block, forming a chain of transactions.
An ideal language for text editing would also be able to record the hash of any document’s state at any time. It should then include that hash into the next round of hashing after the next round of changes. As with a blockchain application, these hashes could be published independently of the materials themselves and used to prove a chain of edits. With linked hashes forming a blockchain, the language could be used to quickly determine if a document, a group of documents, or an entire editing session was fully intact or if it had been edited in any way, including showing which previous version (if any) is matched by an unknown version of a document by comparing hashes without having to actually compare their contents.
Digital computers can only make two kinds of tests, based on the values held in their registers or in storage. A computer can tell you if two values in registers are the same, or if a value in a register is the same as one in storage. This kind of testing mirrors the evaluation that humans do when they encounter facts: Do the facts align with something already stored in our memories?
In a programming system for writing, the values of all factual information (WALTDISNEY:birthday = 1901 and JUPTER:moons = 79) can be tested against future references. If factual information were to be extracted from narrative text and stored in a computer variable, it could be tested against any future reference to that information – regardless of the containing document or the format of the presentation.
Computers can also evaluate “near misses” and this is the basis of features like spelling correction. If a known value were held in storage, such as the name Walter Elias Disney, it could be matched against partial or incomplete text, such as “Walt” or “W.E.D.” Even words like “him” could be matched to an existing variable value, depending on context. Such a system would be able to connect multiple references to a person, place, or thing in narrative text – even if those references don’t include any exact text which matches the subject.
The ability to quickly test variables against each other makes it easy to see not only if a document has changed but in what way it has changed. The health plan section of a corporate HR handbook, for example, is likely to have very different formatting and parameters from year to year. This can make it difficult to compare or even find information in the new version of the document. Written as a text programming language, it would be easy to detect if a user was consulting an out-of-date version not just of the entire document, but of the specific facts inside it which have changed.
Branching & Looping
Branching and looping are essential parts of Turing completeness, something that defines every modern computer. In the example algorithm for research mentioned earlier, a human will repeatedly “branch” based on some discovered value, or “loop” over a set of input values and perform the same action with each one.
In the context of a text editing language, however, these are detrimental concepts because they open the system up to errors in interpretation and to security exploits. A safer way to represent human branches and loops is to simply record them as linear actions performed in a sequence. The rationale behind the actions (branching) or the repeated review or editing of the same object (looping) are essentially inconsequential. Therefore, a programming language for text editing should not include branching or looping instructions.
Recursion is a form of looping: looping over one’s self. A recursive program is one which performs some task on part of its inputs and then calls itself with the remaining inputs to complete the task. While recursion is a powerful programming methodology and useful in the implementation of a programming language for text editing, the actual language itself cannot include recursion due its unpredictability and the potential for exploits with unintended consequences.
Recursion often makes programs shorter, but this is of negligible benefit in a text editing system. Rather than call a routine which calls itself repeatedly, a text editing language should simply perform each individual action and record them in sequence. This simplifies “rewinding” the program’s (and the user’s) behavior to an earlier point and avoids ambiguities in the interpretation of the final result.
Compilation & Dependency Injection
In programming terms, compilation means packaging up an application in the smallest possible form that can run on another system. Research materials could greatly benefit from the same packaging. When an author refers to another person’s works, as most writing does, the original works themselves could be “compiled” into a composite document to allow the reader to access them without searching outside the system. The compiled file is smaller than the total of the materials available in the editing session because not all of the original materials are referenced in the final writing.
Today, we often use web links or citations of printed materials in our own original writing. The problem with these is that they are external to the document we’re writing. There’s no guarantee that the next person will find our links still active or will have access to the original printed documents we consulted. The programming correlate is dependency injection, in which outside code needed to run your program is injected into the source before the program starts.
A programming language for research and writing would need to include the full text of the materials that we consult (with their own facts separated into variable assignments, as previously mentioned) in order to be truly useful to the next researcher. The language should be expressive enough that dependencies can be easily identified, resolved, and injected into any code fragment. Finally, the entire body of research must be able to be compiled into a runnable program which produces the same body of research on another person’s machine. This follows Vannevar Bush’s model of recording the full text of the author’s reading along with his or her writing and sharing both of these with the next researcher.
Serialization & Backup
Unlike the apps we use today, a software program with memex-like capabilities would need to retain our work in between sessions. It should be easy to see what we worked on yesterday, last week, or last year. We should be able to extract any part of that work and use it in something new.
In programming terms, this represents serialization or saving the system state so that the machine can be restarted at the same place. The same programming language features that give us dependency injection also facilitate serialization. The system can take a snapshot of its state at any time and inject any material dependencies into the snapshot file. The system can also remove any text or materials on which the snapshot is not dependent (as in compilation), making it as small as possible.
A snapshot file like this is the ideal form of backup for a text editing session. By running the language statements again (on the same or another machine), the state can be returned with all documents intact. Canon or authoritative facts can also be separated into their own snapshot and used with a variety of editing sessions so that the original corpus doesn’t get repeated in the backup.
Bush’s memex introduced the idea of a storage medium that was also the sharing medium. The same microfilm roll used to document one’s own work could be duplicated, spliced, or printed, giving the next researcher the same path through the materials that you took – or any edited version of those that you chose. By the same token, the next person might inherit an entire reel and only choose to view or print a handful of frames.
A programming language for text editing would need to offer this same flexibility. An author should be able to extract and share any part of his or her work. Likewise, the next person should be able to extract and use only the pieces that are of interest. The dependency resolution and compilation ideas mentioned earlier can also provide this flexibility. The language definition should include an easy way to identify and resolve dependencies in any text fragment, up to and including entire books from which it quotes. The language should also make it evident from any point in text which materials are not being referenced so that the smallest possible useful piece can be extracted.
A file containing a list of programming expressions like these is not data itself; rather, its the instructions on how to create that data. These instructions can be started or stopped at any point. The conditions leading up to any expression can be quickly recreated (by running the expressions before it). Instructions which don’t affect the data we want can be easily ignored or removed. A programming language view of text is like a vector file for an illustration vs a bitmap image. It offers infinite transformation of the data contained in the document.
Extensibility in a programming language comes from two kinds of freedoms: what can you do with a language as its defined today and what you can do to it to evolve or re-purpose the language in the future.
Doing cool things with a language requires that it have a syntax which is easy to interpret in your own head. It should be clear from any language statement exactly what behavior it will have. It should also be clear what dependencies like previous code or documents are implied from any particular expression. It should be hard to make errors but easy to catch them.
A text editing language should simplify much of the “heavy lifting” of working with documents. It should be significantly easier to describe and achieve text editing results with this specialized language than it would be with a traditional programming language. A new language inspires more people if it is well-documented. It needs examples of all code syntax and multiple usage scenarios.
You can’t do too much to a language if it is broken or locked up in someone else’s intellectual property. The language description must be robust and fixed from the beginning such that old code “just works” in future systems. The language can’t be controlled by one company. It’s necessarily open source. It must be easy to incorporate this new language into many types of software, user interfaces, and storage systems and those new programs should be inter-operable by design, even if the language designer never imagined them.
A programming language for research recipes would open up exciting new possibilities for reading, writing, verifying, correcting, and sharing written information, but creating such a language requires new ways of thinking about programming and a radical departure from traditional language design.
Such a language needs to…
- Be linear and non-procedural.
- Offer a “vector” view into editing rather than a “bitmap” of text snapshots.
- Be imperative in declaration and functional in execution.
- Use an operand operator syntax.
- Offer variable assignment and recall across documents and programs.
- Hash documents for verification and version control.
- Use testing to find and correct data in variables.
- Avoid branching, looping, or recursion.
- Simplify debugging.
- Offer compiled and portable versions of any text fragment.
- Include built-in serialization and backup.
- Be easily extensible.
The Mimix Company
The Mimix Company is developing a programming language for text editing, built around the concepts discussed here. The open source (msl), Mimix Stream Language, will be incorporated into the Company’s own (msl) Engine and (msl) Viewer applications and will be available free for use in word processors, research tools, and databases of all kinds. It is the first software product for recording, verifying, correcting, and disseminating factual information based on this new and specialized programming language for text editing and analysis.
– David Bethune