Implementing a toy programming language in R

R
programming
Published

July 25, 2024

In my ever ongoing endeavor to understand the inner workings of R I started poking around the source code for R’s read-eval-print loop.1 From there, I did the usual Googling and came across Robert (Bob) Nystrom’s book Crafting Interpreters. Needless to say I was overjoyed and wanted to dive right in, but alas, now is not quite the right time for me to start crafting my own fully-featured programming language from scratch.

What I do have time for, however, is spending a few hours across a one or two summer evenings on implementing just a tiny little toy language in R. Why on earth would you use R for such a thing, you ask? Good question. Since this whole mess started with me wanting to gain a deeper understanding of R, and since it’s the only language I know well enough to make it happen in a few evenings, here we are. In this post I will show you how I implemented — using only base R — my own very small, very polite and very impractical programming language named please.

1 Design goals

I quickly sketched out the design and basic implementation of please on a sheet of paper. It looked like the deranged scribblings of a madman, but summarized what I wanted to achieve:

  • A handful of functions to print strings, add and multiply numbers, clear the output, and exit
  • A keyword that signals end of input, similar to how C and C-like languages use semicolons; maybe something like please? That would be very polite!
  • A way to run please code, ideally hiding the fact that the whole thing runs on R in the background
  • Make the language as robust as possible, with user-friendly error messages
  • Avoid external dependencies; only base R is allowed

This will not amount to anything even close to Turing complete, but it will require me to whip together the bare essentials of a programming language.

2 What’s in a programming language?

Let’s consider first what a tiny little programming language like please needs to get going. There are three basic requirements:

  1. A lexer, which will make sense out of the input
  2. A parser, which will attach meaning to the input
  3. An interpreter, which will evaluate the input

This is not always the case, but holds true in general. Some languages make use of a compiler to turn code written in one language, typically a high-level language, into code written in another, typically a low-level language. Such fanciness is not on the table here. In fact, please will be purely interpreted; you can’t even put code into a file and run it from there.2

3 The please lexer

Consider the following sample input:

add 1 1 please

Before lexical analysis, this is just meaningless gibberish; a stream of characters with no inherent meaning.3 The job of a lexer, otherwise known as a scanner, is to turn gibberish into syntactically meaningful tokens. In natural language, tokens are nouns, verbs, punctuation and so on. In a programming language, tokens are functions, operators, symbols and everything else we use to make stuff happen. As such, tokens are the smallest meaningful units of a programming language.

To be pedantic, the lexer breaks down the input into lexemes, and then attaches some value or context to them. A lexeme plus a value equals a token. The lexer might say, this is a function; the name of the function is add, or this is a number; the value of the number is 1. Those would be the tokens. A real programming language has quite a few tokens, but please only needs a handful:

please_tokens <- c(
    LETTER = '[[:alpha:]]',
    NUMBER = '[0-9]+',
    SYMBOL = '[[:punct:]]',
    FUNC = 'say',
    FUNC = 'scream',
    FUNC = 'add',
    FUNC = 'multiply',
    QUOTE = '"',
    WHITESPACE = '\\s+',
    EOI = 'please',
    CLEAR = 'clear',
    EXIT = 'exit')

Armed with these tokens the lexer will be able to figure out what in the input is a letter, a number, a symbol, a function, a double quote and a whitespace. There is also the eponymous please keyword, signalling end of input, the CLEAR keyword used to clear terminal output, and the EXIT keyword for when the user wants to exit.

All right, we have a named vector of tokens. Let’s consider again the sample input above. How can the lexer be made to figure out what is what? One way is to use some clever regex. And not even that clever, to be honest. The R function gregexpr() comes in handy:

gregexpr returns a list of the same length as text each element of which is of the same form as the return value for regexpr, except that the starting positions of every (disjoint) match are given.

Thus, gregexpr() gets the position of each token in the input, if there is one. Here’s an example using the first token, letters, in the please_tokens vector:

input <- "add 1 1 please"
pos <- gregexpr(please_tokens[1], input)[[1]]
pos
[1]  1  2  3  9 10 11 12 13 14
attr(,"match.length")
[1] 1 1 1 1 1 1 1 1 1
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE

According to the output, the input string contains letters at positions 1 through 3 and 9 through 14, which is correct. Good job, gregexpr(). The match.length attribute is, you guessed it, the length of each match, which in this case is 1 for every letter. Matching the letters first is necessary since the function names also consist of letters, and thus would get “overwritten” otherwise. As I said, this is not a very clever lexer, but it works well enough.

In my somewhat rushed and hand-wavy implementation, the please lexer first creates a vector named matches of the same length as the number of characters in input and then loops through each token in please_tokens. Inside the loop, gregexpr() gets the position(s) of the current token, and if at least one is found assigns the start position and the length of the match to a temporary data frame. Since there may be more than one match, another loop iterates over each row in the temporary data frame and calculates the start and end positions of the match, and finally puts the name of the token into the corresponding position(s) in the matches vector. Easy peasy:

matches <- vector("character", length = nchar(input))

for (i in 1:length(please_tokens)) {
    
    token <- please_tokens[i]
    token_name <- names(token)
    pos <- gregexpr(token, input)[[1]]
    
    if (pos[[1]] != -1) {
      
        tmp <- data.frame(pos_start = pos[pos > 0],
                          pos_len = attr(pos, "match.length"))
        
        for (row in 1:nrow(tmp)) {
            start <- tmp[row, "pos_start"]
            end <- start + tmp[row, "pos_len"] - 1
            matches[start:end] <- token_name
            
        }
    }
}

With add 1 1 please as input, here’s how matches will look:

matches
 [1] "FUNC"       "FUNC"       "FUNC"       "WHITESPACE" "NUMBER"    
 [6] "WHITESPACE" "NUMBER"     "WHITESPACE" "EOI"        "EOI"       
[11] "EOI"        "EOI"        "EOI"        "EOI"       

It seems like the lexer figured it out. But these are just the lexemes. In order to turn them into tokens they need some value or context. Luckily, its easy to split input into its constituent characters place them together with the matches vector in a data frame:

lex_output <- data.frame(character = strsplit(input, "")[[1]], token = matches)
lex_output
   character      token
1          a       FUNC
2          d       FUNC
3          d       FUNC
4            WHITESPACE
5          1     NUMBER
6            WHITESPACE
7          1     NUMBER
8            WHITESPACE
9          p        EOI
10         l        EOI
11         e        EOI
12         a        EOI
13         s        EOI
14         e        EOI

Would you look at that. A working lexer. The code above is shoved into a function named please_lexer that returns lex_output. Next, parsing.

4 The please parser

Parsing, otherwise known as syntax analysis, is the task of taking the tokens spit out by the lexer and making sure that they conform to the grammar of the language. Yes, just like a real programming language, please too needs a grammar, with rules that must be followed at all times. It’s a simple grammar, to be sure, but I wanted it to be robust enough so that the whole thing doesn’t break down on obvious syntax errors as well as helpful enough so that it gives proper error messages. Note that from now on, I’ll refer to lex_output, which was created by the lexer just a minute ago, as just input, since it’s fed into the parser.

The whole parser is basically a bunch if if, else statements, with each statement corresponding to a grammatic rule. As a general principle I made it so that statements returning TRUE or matching some specific value are indicative of a syntax error. When a syntax error is found, R’s stop() function is called with an appropriate message, like so:

stop("Syntax error: You can't do that, you silly goose!")
Error in eval(expr, envir, enclos): Syntax error: You can't do that, you silly goose!

4.1 Exiting and clearing

I wanted there to be a quick and clean way of exiting and of clearing the terminal output. Ideally, these two commands should take precedence over everything else, so no resources are wasted.4 Rule #1 therefore states that if the first four characters in the input are exit, the program should exit immediately no matter what comes after, and if the first five characters in the input are clear, all terminal output should be cleared no matter what comes after:

if (length(grep("EXIT", input[1:4, "token"])) == 4) {
  expr <- ("EXIT")
} else if (length(grep("CLEAR", input[1:5, "token"])) == 5) {
  expr <- ("CLEAR")
} else { 
  ...
}

To be specific, it’s the tokens in input that that are tested in the if, else statement, but since users typically don’t know or care about tokens or lexemes or the specifics about how the language is implemented, the rules are written to be user-friendly. Anyways. If either of the above happens, an object named expr is created, containing only a EXIT or CLEAR string. I’ll get back to expr later. If not, the parser moves on to create a variable that holds the number of rows in input and then starts checking general syntax rules for function calls.

4.2 General syntax rules for function calls

A very general thing to check is that there even is a function in the input; if not then there’s not much for please to do at this stage. But to be a bit more precise, I want please to only accept input that begin with function calls. Rule #2 states that the very first characters in the input must form a valid function call:

if (input[1, "token"] != "FUNC") {
  stop("Syntax error: Input must begin with a valid function call")
}

At this point the parser knows there’s a valid function, so it’d be wise to get its name. Notably, it should only get the first instance of a function name, in case there are several. A user might put a function name in a string, like so: say "add" please. In that case, say is the function name of interest. Doing this in base R took longer than it should. I blame being spoiled by the tidyverse and whatnot:

lag_token <- c(NA, input$token[-nrow(input)])
lag_token[1] <- input$token[1]
cumsum_condition <- cumsum(input$token != lag_token)
input_func <- input[cumsum_condition < 1, ]
func_name <- paste0(input_func$character[input_func$token == "FUNC"], collapse = "")

Even if a valid function call comes first, it would be weird to allow syntax like add1 1 please or say"hello" please. Ugh. Rule #3 states that the first character after function call must be a whitespace:

if (input[nchar(func_name) + 1, "token"] != "WHITESPACE") {
  stop("Syntax error: Missing whitespace after function call")
}

That’s it, only three rules so far. Next up is handling the please keyword used to signal end of input, or EOI.

4.3 Syntax rules for the please keyword

As mentioned, please is a very polite language, but you can’t just say please willy-nilly. Rule #4 states that every input must end with the please keyword:

if (length(grep("EOI", input[as.integer(nrow(input) - nchar("please") + 1):nrow(input), "token"])) != 6) {
  stop("Syntax error: Keyword 'please' is not at end of input")
}

As with the whitespace following function calls, syntax such as add 1 1please looks ugly and simply cannot be allowed. Rule #5 states that there must be a whitespace before the please keyword:

if (input[nrow(input) - nchar("please"), "token"] != "WHITESPACE") {
  stop("Syntax error: Missing whitespace before 'please' keyword")
}

Now the parser is done with the please keyword, and still only five rules! On to the say and scream functions.

4.4 Syntax rules for the say and scream functions

I could have just gone with print for printing strings, but that wouldn’t be very interesting. It much more fun, albeit not very practical, to have a programming language that will say things and SCREAM things. I’m conventional enough, though, to think it makes sense for strings to be enclosed in double quotes. Rule #6 states that the second character after a call to say or scream, following the mandatory whitespace, must be a double quote:

if (input[nchar(func_name) + 2, "token"] != "QUOTE") {
  stop(paste0("Syntax error: Missing double quote after call to '", func_name, "'"))
}

Of course, where there is one double quote, there needs to be another. Rule #7 states that there must be another double quote following the first one, thus enclosing the string:

if (length(grep("QUOTE", input[as.integer(nchar(func_name) + 3):input_len, "token"])) == 0) {
  stop(paste0("Syntax error: Strings following call to '", func_name, "' must be enclosed in double quotes"))
}

I don’t want to bother with escape characters, which would make it possible to print double quotes within strings as well. Rule #8 states that double quotes may only be used to enclose strings:

if (length(grep("QUOTE", input[, "token"])) > 2) {
  stop("Syntax error: Double quotes may only be used to enclose strings")
}

The rest is easy; everything between the two double quotes is treated as a string meant to be output:

str_to_output <- sub('[^\"]+\"([^\"]+).*', '\\1', paste0(input[, "character"], collapse = ""))

Then, the parser converts the string either to uppercase for scream or lowercase for say:

if (func_name == "scream") {
  str_to_output <- toupper(str_to_output)
} else {
  str_to_output <- tolower(str_to_output)
}

Finally, an object named expr is created which contains the expression to evaluate later on:

expr <- paste0("print(", deparse(str_to_output), ")")

So far so good, and not even double digits yet with only eight rules. Time for the parser to tackle the add and multiply functions.

4.5 Syntax rules for the add and multiply functions

Calls to the add or multiply functions should be followed immediately by the numbers to add or multiply. Rule #9 states that the second token after a call to add, following the mandatory whitespace of course, must be a number:

if (input[nchar(func_name) + 2, "token"] != "NUMBER") {
  stop(paste0("Syntax error: Missing number after call to '", func_name, "'"))
}

It’s not possible to add a single number, so there must be at least one more. Rule #10 states that there must be at least one more number following the first one:

if (length(grep("NUMBER", input[as.integer(nchar(func_name) + 3):input_len, "token"])) < 1) {
  stop(paste0("Syntax error: At least two numbers are required for '", func_name, "'"))
}

The parser needs to differentiate between groups of numbers, and there’s no room for other characters. Rule #11 states that all numbers must be followed either by another number or a whitespace:

num_vec <- input[as.integer(nchar(func_name) + 1):input_len, "token"]
for (i in seq_along(num_vec)) {
  if (num_vec[i] == "NUMBER" && (i == length(num_vec) || !num_vec[i + 1] %in% c("WHITESPACE", "NUMBER"))) {
    stop("Syntax error: Numbers must be followed by another number or a whitespace")
  }
}

So far the parser has mostly dealt with fairly straightforward if, else statements. For the next part, a bit more effort is required. The goal is to gather all the numbers supplied to input and add or multiply them. The problem is that if the user wants to, for instance, add 99 and 1, the parser needs to understand that 99 is a single number, not two separate nines:

num_to_add <- as.numeric(input[grep("NUMBER", input[, "token"]), "character"])
grouped_numbers <- list()

current_group <- NULL
previous_token <- NULL

for (i in 1:nrow(input)) {
  if (input$token[i] == "NUMBER") {
    if (is.null(current_group) || previous_token != "NUMBER") {
      current_group <- input$character[i]
    } else {
      current_group <- paste0(current_group, input$character[i])
    }
  } else {
    if (!is.null(current_group)) {
      grouped_numbers <- c(grouped_numbers, as.numeric(current_group))
      current_group <- NULL
    }
  }
  previous_token <- input$token[i]
}

if (!is.null(current_group)) {
  grouped_numbers <- c(grouped_numbers, as.numeric(current_group))
}

grouped_numbers <- unlist(grouped_numbers)

The above code — not the most elegant I admit — will loop through input and keep track of each number, and whether that number was preceded by a token other than NUMBER. If it was then it’s the start if a new number, and if it wasn’t then it’s the continuation of an existing number. Once that is done, all that is left for the parser to do is create an expr object:

if (func_name == "add") {
  expr <- paste0("print(sum(", deparse(grouped_numbers), "))")
} else {
  expr <- paste0("print(prod(", deparse(grouped_numbers), "))")
}

We’re done! Just eleven rules to create a — hopefully — working parser. The code is put into a function named please_parser that returns expr. The final step is to write the interpreter.

5 The please interpreter

Writing a barebones interpreter is fairly easy once the lexer and parser works, at least in the case of please. The parser dutifully returns the expression to be evaluated — the expr object — so all that is left for the interpreter to do is to (1) read and evaluate it, (2) print whatever needs to be printed, (3) and loop back to reading again. A REPL if there ever was one.

I first want the REPL to clear any existing terminal output and print a message of the day, because it seems professional:

system("clear")
message("This is please version 1.0.0\n",
        "See documentation for details:\n",
        "https://github.com/carldelfin/please\n")

After that a never-ending loop — while (TRUE) { ... } — is started. The first thing that needs to happen inside the loop is the read part, which starts a prompt that waits for and accepts user input. At first I wanted to go with R’s readline() function, but that only works in interactive mode.5 Instead, the prompt is a two-step procedure:

cat("~ ")
please_prompt <- readLines("stdin")

The first line just prints a tilde and a whitespace; this is the prompt. I could have gone with R’s good old >, but I wanted to differentiate the two. Then readLines() is used to create a connection to stdin and wait for input.6 Whatever the user types is stored in please_prompt. Note the n = 1; it’s necessary to read only one line of input that is ended by hitting the Return key.

Then comes the eval part. It’s just a nested call to the lexer and parser, but the important thing is that it’s wrapped in a tryCatch(). That way, if there are any syntax errors the error will be printed and parse_output will contain NA, but the never-ending loop won’t exit. If there aren’t any errors then parse_output will contain the expr object that was returned from the parser:

pars_output <- tryCatch(
  please_parser(please_lexer(please_prompt)),
  error = function(cond) {
    message(conditionMessage(cond))    
    NA
  }
)

Last but not least, it’s time to print something. If pars_output is not NA but instead EXIT, exit R. If pars_output is not NA but instead CLEAR, clear the terminal output. And if pars_output is simply not NA, parse and evaluate whatever is in pars_output:

if (!is.na(pars_output) & pars_output == "EXIT") {
  quit(save = "no")
} else if (!is.na(pars_output) & pars_output == "CLEAR") {
  system("clear")
} else if (!is.na(pars_output)) {
  eval(parse(text = pars_output))
}

That should be enough for a working interpreter. The code for is put into a function named initiate_repl.

6 Wrapping up

Everything I’ve shown above is put into a file named please.R, with initiate_repl() at the very bottom. Just put an alias in your shell config, (i.e., alias please='Rscript /path/to/please.R'), source it, and that’s all there is to it. A lexer, parser and interpreter for a tiny toy programming language implemented using a bit of regex and eleven simple grammatical rules, all in about 250 lines of R code. Does it have a lot of features? Quite the contrary. Is it practical? Not at all. Does it work? Yes, just look:

Figure 1: Showcasing please in all its (very limited) glory.

What about syntax errors, you say? Here’s a few of them:

Figure 2: Syntax errors in please.

The final question is, did it take just one or two summer evenings to do this? Well. In a sense, yes. I’d say I wrote 90% of it in just a few hours, lounging in a cozy armchair at my mother-in-law’s place. The remaining 10% — fixing minor bugs and annoyances, refactoring, getting the screen recording to work, typing out this blog post, and so on — took quite a lot more time than anticipated. That’s on me; I even whipped up a logo to show off at the required GitHub repository.

At this point, please is indeed tiny. Maybe too tiny to even be called a programming language. In the future it would be fun to explore adding operators, thus replacing or extending add and multiply with +, *, -, / and so on, with proper precedence rules and everything. I’d also like to implement variables at some point, and a way to run please code from text files (something like please script.pls). A call stack would come in handy, which would be a fun challenge. But at that point I might as well go through Crafting Interpreters for real.

Footnotes

  1. It’s available in main.c, if you want to have a look, starting at line 122.↩︎

  2. But one day, perhaps, you might.↩︎

  3. You, dear reader, can hopefully figure out what the goal is, but the computer can’t.↩︎

  4. Not that please is likely to spin up your CPU fans anytime soon, but still.↩︎

  5. You can check if you’re in interactive mode using the suitably named interactive() function.↩︎

  6. Why is readline written with lowercase only, while readLines written using camel case? One of R’s many idiosyncrasies.↩︎