I really didn't want to have to write a tokenizer and parser by hand for lispBM. Writing something like this is not what I feel entirely comfortable with. But, it is important to step out of the comfort zone now and then.
For a long time the lispBM parser was implemented using the MPC library. This worked very well and all I had to do was come up with a set of regular expressions that expressed what kind of syntactical objects the parser should be able to recognize and specify up how these were to be related to eachother in a tree structure (the parse tree). With that little bit of code in place MPC provided the parser that read source code returned a tree data structure. Then a little bit of code is needed to traverse the tree and generate the heap structure. On the microcontrollers that lispBM target, memory is a valuable resource and MPC was a bit hungry on that resource. So that is why this code came to be.
The code for parsing lispBM source into heap representations is
located in the files tokpar.c
and tokpar.h
.
The parser can handle parsing both plain text source code and compressed
source data. This text will however just show the plain text case,
saving compression and parsing of compressed code for later.
The parsing is done by reading tokens from the source code.
The tokens represent, in a way, complete syntactical objects.
In the lispBM tokenizer that means that for example (
is
one token and 3.14159
is another. That is, a token can be
one or more consecutive characters. The parser then uses these tokens
read out from the text to produce corresponding heap representation.
If you are interested in more of an overview of what lispBM is, read here.
The file tokpar.c
starts off by defining a few names for
the different tokens that make up valid lispBM programs. A value to
represent error is also defined as well as a value that signals that the
end of the source code has been reached.
#define TOKOPENPAR 0
#define TOKCLOSEPAR 1
#define TOKQUOTE 2
#define TOKSYMBOL 3
#define TOKINT 4
#define TOKUINT 5
#define TOKBOXEDINT 6
#define TOKBOXEDUINT 7
#define TOKBOXEDFLOAT 8
#define TOKSTRING 9
#define TOKCHAR 10
#define TOKENIZER_ERROR 1024
#define TOKENIZER_END 2048
There is a function called next_token
that returns the
next token in the source string. The return value is of the following
type:
typedef struct {
unsigned int type;
unsigned int text_len;
union {
char c;
char *text;
INT i;
UINT u;
FLOAT f;
}data;
} token;
the token
type can represent all the valid tokens using
the union
field in the struct
. There is a
type
field to differentiate between types this will be set
to one of the defined values above. If there are no more tokens the
type
field is set to TOKENIZER_END
and if
there is an error it will be TOKENIZER_ERROR
.
We also need to keep track of where in the source text we are
currently reading tokens from. This state is managed by a
struct
called tokenizer_state
, holding a
pointer to the string and an index (pos
).
typedef struct {
char *str;
unsigned int pos;
} tokenizer_state;
The next struct
, called
tokenizer_char_stream
exists with the purpose of making
code reuse between the tokenizer for plain strings and compressed
strings better. It is an abstracted representation of a stream of
characters that we can peek
into, drop
from
and so on.
typedef struct tcs{
void *state;
bool (*more)(struct tcs);
char (*get)(struct tcs);
char (*peek)(struct tcs, int);
void (*drop)(struct tcs, int);
} tokenizer_char_stream;
The void *state
can be instantiated with either the
tokenizer_state
seen above or another variant of state for
compressed source that also contains for example a decompression
buffer.
Then a set of functions that work on any
tokenizer_char_stream
are defined to make the code more
readable.
bool more(tokenizer_char_stream str) {
return str.more(str);
}
char get(tokenizer_char_stream str) {
return str.get(str);
}
char peek(tokenizer_char_stream str,int n) {
return str.peek(str,n);
}
void drop(tokenizer_char_stream str,int n) {
str.drop(str,n);
}
And the implementation of the different stream functions for the
plain string case are defined as follows. When creating a
tokenizer_char_stream
for use on plain strings the function
pointers in the struct has to be pointed to these (or similar)
functions.
bool more_string(tokenizer_char_stream str) {
tokenizer_state *s = (tokenizer_state*)str.state;
return s->str[s->pos] != 0;
}
char get_string(tokenizer_char_stream str) {
tokenizer_state *s = (tokenizer_state*)str.state;
char c = s->str[s->pos];
s->pos = s->pos + 1;
return c;
}
char peek_string(tokenizer_char_stream str, int n) {
tokenizer_state *s = (tokenizer_state*)str.state;
// TODO error checking ?? how ?
char c = s->str[s->pos + n];
return c;
}
void drop_string(tokenizer_char_stream str, int n) {
tokenizer_state *s = (tokenizer_state*)str.state;
s->pos = s->pos + n;
}
That is what is needed when it comes to plumbing (and helper
functions) and it is time to implement the tokenizer. First off
there is one function per token. These functions can be thought of as a
try to tokenize as functions. These all return an integer that
is 0
in case it could not read its dedicated kind of token
from the head of the stream. If the token can be read from the stream,
the number of consumed characters is returned.
In the case of the (
, )
and '
tokens, these functions return 1
if the sought token is
there. They produce no other data.
int tok_openpar(tokenizer_char_stream str) {
if (peek(str,0) == '(') {
drop(str,1);
return 1;
}
return 0;
}
int tok_closepar(tokenizer_char_stream str) {
if (peek(str,0) == ')') {
drop(str,1);
return 1;
}
return 0;
}
int tok_quote(tokenizer_char_stream str) {
if (peek(str,0) == '\'') {
drop(str,1);
return 1;
}
return 0;
}
Symbols are made up of a set of allowed characters. The characters
that are allowed in the first position of a symbol is different from
those allowed in any of the other positions. The following two boolean
functions return true
for valid symbol characters and
false
otherwise, of course.
bool symchar0(char c) {
const char *allowed = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ+-*/=<>";
int i = 0;
while (allowed[i] != 0) {
if (c == allowed[i++]) return true;
}
return false;
}
bool symchar(char c) {
const char *allowed = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789+-*/=<>";
int i = 0;
while (allowed[i] != 0) {
if (c == allowed[i++]) return true;
}
return false;
}
In addition to returning the number of characters consumed, the rest of the try to tokenize functions also provide a value in the case of success.
int tok_symbol(tokenizer_char_stream str, char** res) {
if (!symchar0(peek(str,0))) return 0;
int i = 0;
int len = 1;
int n = 0;
while (symchar((peek(str,len)))) {
len++;
}
*res = malloc(len+1);
memset(*res,0,len+1);
for (i = 0; i < len; i ++) {
(*res)[i] = tolower(get(str));
n++;
}
return n;
}
The tok_symbol
function checks if the next character in
the stream is a valid beginning character for a symbol. If it is the
function loops through the string as long as each character it looks at
is a valid inner symbol character. As soon as a non symbol
character is found the loop exits and the valid characters are taken out
from the stream (using get
) and added to the result
res
pointer provided. From this code you can also see that
tolower
is used on the symbols. This means that the strings
APA
and apa
refer to the same symbol in
lispBM.
int tok_string(tokenizer_char_stream str, char **res) {
int i = 0;
int n = 0;
int len = 0;
if (!(peek(str,0) == '\"')) return 0;
get(str); // remove the " char
n++;
// compute length of string
while (peek(str,len) != 0 &&
peek(str,len) != '\"') {
len++;
}
// str ends before tokenized string is closed.
if ((peek(str,len)) != '\"') {
return 0;
}
// allocate memory for result string
*res = malloc(len+1);
memset(*res, 0, len+1);
for (i = 0; i < len; i ++) {
(*res)[i] = get(str);
n++;
}
get(str); // throw away the "
return (n+1);
}
The tok_string
function is very similar to
tok_symbol
. If there is an "
character at the
head of the stream, everything onwards (until there is another
"
is read into the result. The tok_string
function needs to be improved a bit to recognize some kind of escaped
characters (like newline), which it currently doesn't. If you enter the
string "hello \n"
or even (print "hello \n")
,
for example, into the REPL it will reply with hello \n
and
not hello
followed by a line break. As we will see below
character literals are given to the REPL as \#a
, for the
character a
and the newline character is expressed as
\#newline
, maybe it would make sense to also escape the
newline character in the same way in a string? I've noticed that
emacs-lisp uses the syntax ?a
for the character a. Maybe
going over to the same representation here make sense.
So, currently, the tok_char
function below looks for the
character syntax, that is \#
, followed by a character.
int tok_char(tokenizer_char_stream str, char *res) {
int count = 0;
if (peek(str,0) == '\\' &&
peek(str,1) == '#' &&
peek(str,2) == 'n' &&
peek(str,3) == 'e' &&
peek(str,4) == 'w' &&
peek(str,5) == 'l' &&
peek(str,6) == 'i' &&
peek(str,7) == 'n' &&
peek(str,8) == 'e') {
*res = '\n';
drop(str,9);
count = 9;
} else if (peek(str,0) == '\\' &&
peek(str,1) == '#' &&
isgraph(peek(str,2))) {
*res = peek(str,2);
drop(str,3);
count = 3;
}
return count;
}
The tokenizers for 28 and 32 bit signed and unsigned integers are all
very similar. Some of them, 28bit unsigned, 32bit signed and 32bit
unsigned, require that the number is followed by a type qualifier such
as u28
, u32
or i32
so that the
appropriate type can be set for, and the suitable cons cell structure
can be allocated.
int tok_i(tokenizer_char_stream str, INT *res) {
INT acc = 0;
int n = 0;
while ( peek(str,n) >= '0' && peek(str,n) <= '9' ){
acc = (acc*10) + (peek(str,n) - '0');
n++;
}
// Not needed if strict adherence to ordering of calls to tokenizers.
if (peek(str,n) == 'U' ||
peek(str,n) == 'u' ||
peek(str,n) == '.' ||
peek(str,n) == 'I') return 0;
drop(str,n);
*res = acc;
return n;
}
int tok_I(tokenizer_char_stream str, INT *res) {
INT acc = 0;
int n = 0;
while ( peek(str,n) >= '0' && peek(str,n) <= '9' ){
acc = (acc*10) + (peek(str,n) - '0');
n++;
}
if (peek(str,n) == 'i' &&
peek(str,n+1) == '3' &&
peek(str,n+2) == '2') {
*res = acc;
drop(str,n+3);
return n+3;
}
return 0;
}
int tok_u(tokenizer_char_stream str, UINT *res) {
UINT acc = 0;
int n = 0;
while ( peek(str,n) >= '0' && peek(str,n) <= '9' ){
acc = (acc*10) + (peek(str,n) - '0');
n++;
}
if (peek(str,n) == 'u' &&
peek(str,n+1) == '2' &&
peek(str,n+2) == '8' ) {
*res = acc;
drop(str,n+3);
return n+3;
}
return 0;
}
32bit values can also be entered using hexadecimal notation. The tokenizer for this case is a little bit larger to also handle reading this hexidecimal notation.
int tok_U(tokenizer_char_stream str, UINT *res) {
UINT acc = 0;
int n = 0;
// Check if hex notation is used
if (peek(str,0) == '0' &&
(peek(str,1) == 'x' || peek(str,1) == 'X')) {
n+= 2;
while ( (peek(str,n) >= '0' && peek(str,n) <= '9') ||
(peek(str,n) >= 'a' && peek(str,n) <= 'f') ||
(peek(str,n) >= 'A' && peek(str,n) <= 'F')){
UINT val;
if (peek(str,n) >= 'a' && peek(str,n) <= 'f') {
val = 10 + (peek(str,n) - 'a');
} else if (peek(str,n) >= 'A' && peek(str,n) <= 'F') {
val = 10 + (peek(str,n) - 'A');
} else {
val = peek(str,n) - '0';
}
acc = (acc * 0x10) + val;
n++;
}
*res = acc;
drop(str,n);
return n;
}
// check if nonhex
while ( peek(str,n) >= '0' && peek(str,n) <= '9' ){
acc = (acc*10) + (peek(str,n) - '0');
n++;
}
if (peek(str,n) == 'u' &&
peek(str,n+1) == '3' &&
peek(str,n+2) == '2') {
*res = acc;
drop(str,n+3);
return n+3;
}
return 0;
}
Floating point numbers are identified by there being a decimal point
somewhere within the number. So the tokenizer reads numerals for as long
as possible, then at some point it should find a .
and some
following digits. Once the string has been identified as a float value,
the characters involved are extracted into a buffer and converted to a
float using strtod
.
int tok_F(tokenizer_char_stream str, FLOAT *res) {
int n = 0;
int m = 0;
char fbuf[256];
while ( peek(str,n) >= '0' && peek(str,n) <= '9') n++;
if ( peek(str,n) == '.') n++;
else return 0;
if ( !(peek(str,n) >= '0' && peek(str,n) <= '9')) return 0;
while ( peek(str,n) >= '0' && peek(str,n) <= '9') n++;
if (n > 255) m = 255;
else m = n;
int i;
for (i = 0; i < m; i ++) {
fbuf[i] = get(str);
}
fbuf[i] = 0;
*res = strtod(fbuf, NULL);
return n;
}
The next_token
function extracts a token from the stream
if one is available. If the end of the stream is reached a
TOKENIZER_END
token is returned.
Between tokens, one whitespace may be required but there is never any requirement for more than one whitespace between tokens. So, before trying to fetch the next token all heading whitespace on the stream is read out and discarded. Comments in the source is also treated just like whitespace.
Then each of the tokenizer functions (the try to tokenize
functions) are applied one after the other in order of kind and size of
token they match. The first of these to return a value larger than 0
signals what token was available and next_token
returns.
The order, from larger to smaller tokens, is important in the cases where different tokens have the same valid initial sub-string. In these cases it is important to get the longest, most specific, match.
token next_token(tokenizer_char_stream str) {
token t;
INT i_val;
UINT u_val;
char c_val;
FLOAT f_val;
int n = 0;
if (!more(str)) {
t.type = TOKENIZER_END;
return t;
}
// Eat whitespace and comments.
bool clean_whitespace = true;
while ( clean_whitespace ){
if ( peek(str,0) == ';' ) {
while ( more(str) && peek(str, 0) != '\n') {
drop(str,1);
}
} else if ( isspace(peek(str,0))) {
drop(str,1);
} else {
clean_whitespace = false;
}
}
// Check for end of string again
if (!more(str)) {
t.type = TOKENIZER_END;
return t;
}
n = 0;
if ((n = tok_quote(str))) {
t.type = TOKQUOTE;
return t;
}
if ((n = tok_openpar(str))) {
t.type = TOKOPENPAR;
return t;
}
if ((n = tok_closepar(str))) {
t.type = TOKCLOSEPAR;
return t;
}
if ((n = tok_symbol(str, &t.data.text))) {
t.text_len = n;
t.type = TOKSYMBOL;
return t;
}
if ((n = tok_char(str, &c_val))) {
t.data.c = c_val;
t.type = TOKCHAR;
return t;
}
if ((n = tok_string(str, &t.data.text))) {
t.text_len = n - 2;
t.type = TOKSTRING;
return t;
}
if ((n = tok_F(str, &f_val))) {
t.data.f = f_val;
t.type = TOKBOXEDFLOAT;
return t;
}
if ((n = tok_U(str, &u_val))) {
t.data.u = u_val;
t.type = TOKBOXEDUINT;
return t;
}
if ((n = tok_u(str, &u_val))) {
t.data.u = u_val;
t.type = TOKUINT;
return t;
}
if ((n = tok_I(str, &i_val))) {
t.data.i = i_val;
t.type = TOKBOXEDINT;
return t;
}
// Shortest form of integer match. Move to last in chain of numerical tokens.
if ((n = tok_i(str, &i_val))) {
t.data.i = i_val;
t.type = TOKINT;
return t;
}
t.type = TOKENIZER_ERROR;
return t;
}
The parser is, I believe, an example of a recursive descent parser.
It is split up into a number of mutually recursive functions. The
functions involved are: tokpar_parse
(the entry point),
parse_program
, parse_sexp
and
parse_sexp_list
. The sexp
in these function
names comes from s-expression which
is what the kind of nested-tree-structured list-based expressions that
lisp use are called.
The tokpar_parse
functions sets up a tokenizer state and
a tokenizer character stream and then call parse_program
on
that stream.
VALUE tokpar_parse(char *string) {
tokenizer_state ts;
ts.str = string;
ts.pos = 0;
tokenizer_char_stream str;
str.state = &ts;
str.more = more_string;
str.peek = peek_string;
str.drop = drop_string;
str.get = get_string;
return parse_program(str);
}
parse_program
parses a sequence of expressions from the
stream. It does this calling parse_sexp
on the tokenizer
character stream, which if successful creates the heap representaion of
that first expression. It then recurses on the rest of the character
stream and allocates a cons cell on the heap to combine the result of
the recursive call and the call to parse_sexp
. If an error
occurs or the stream ends, the function returns.
VALUE parse_program(tokenizer_char_stream str) {
token tok = next_token(str);
VALUE head;
VALUE tail;
if (tok.type == TOKENIZER_ERROR) {
return enc_sym(symrepr_rerror());
}
if (tok.type == TOKENIZER_END) {
return enc_sym(symrepr_nil());
}
head = parse_sexp(tok, str);
tail = parse_program(str);
return cons(head, tail);
}
The parse_sexp
function gets a token and a tokenizer
character stream as input. It checks what token it got as input and then
selects a case in a switch
statement. One interesting case
is when the token is an opening parenthesis in which case we are dealing
with a lisp list and the parse_sexp_list
function should
take over. In other cases parse_sexp
create heap
representations for the tokens.
VALUE parse_sexp(token tok, tokenizer_char_stream str) {
VALUE v;
token t;
switch (tok.type) {
case TOKENIZER_END:
return enc_sym(symrepr_rerror());
case TOKENIZER_ERROR:
return enc_sym(symrepr_rerror());
case TOKOPENPAR:
t = next_token(str);
return parse_sexp_list(t,str);
case TOKSYMBOL: {
UINT symbol_id;
if (symrepr_lookup(tok.data.text, &symbol_id)) {
v = enc_sym(symbol_id);
}
else if (symrepr_addsym(tok.data.text, &symbol_id)) {
v = enc_sym(symbol_id);
} else {
v = enc_sym(symrepr_rerror());
}
free(tok.data.text);
return v;
}
case TOKSTRING: {
heap_allocate_array(&v, tok.text_len+1, VAL_TYPE_CHAR);
array_t *arr = (array_t*)car(v);
memset(arr->data.c, 0, (tok.text_len+1) * sizeof(char));
memcpy(arr->data.c, tok.data.text, tok.text_len * sizeof(char));
free(tok.data.text);
return v;
}
case TOKINT:
return enc_i(tok.data.i);
case TOKUINT:
return enc_u(tok.data.u);
case TOKCHAR:
return enc_char(tok.data.c);
case TOKBOXEDINT:
return set_ptr_type(cons(tok.data.i, enc_sym(DEF_REPR_BOXED_I_TYPE)), PTR_TYPE_BOXED_I);
case TOKBOXEDUINT:
return set_ptr_type(cons(tok.data.u, enc_sym(DEF_REPR_BOXED_U_TYPE)), PTR_TYPE_BOXED_U);
case TOKBOXEDFLOAT:
return set_ptr_type(cons(tok.data.u, enc_sym(DEF_REPR_BOXED_F_TYPE)), PTR_TYPE_BOXED_F);
case TOKQUOTE: {
t = next_token(str);
VALUE quoted = parse_sexp(t, str);
if (type_of(quoted) == VAL_TYPE_SYMBOL &&
dec_sym(quoted) == symrepr_rerror()) return quoted;
return cons(enc_sym(symrepr_quote()), cons (quoted, enc_sym(symrepr_nil())));
}
}
return enc_sym(symrepr_rerror());
}
The parse_sexp_list
is conceptually similar to
parse_program
it also parses the head part using
parse_sexp
and then it parses the rest using
parse_sexp_list
. It generates cons cells on the heap to tie
the results together. Differently from parse_program
,
parse_sexp_list
is done when in encounters an closing
parenthesis in which case it returns a nil
to close the
list.
VALUE parse_sexp_list(token tok, tokenizer_char_stream str) {
token t;
VALUE head;
VALUE tail;
switch (tok.type) {
case TOKENIZER_END:
return enc_sym(symrepr_rerror());
case TOKENIZER_ERROR:
return enc_sym(symrepr_rerror());
case TOKCLOSEPAR:
return enc_sym(symrepr_nil());
default:
head = parse_sexp(tok, str);
t = next_token(str);
tail = parse_sexp_list(t, str);
if ((type_of(head) == VAL_TYPE_SYMBOL &&
dec_sym(head) == symrepr_rerror() ) ||
(type_of(tail) == VAL_TYPE_SYMBOL &&
dec_sym(tail) == symrepr_rerror() )) return enc_sym(symrepr_rerror());
return cons(head, tail);
}
return enc_sym(symrepr_rerror());
}
That is it for parsing as it is done in lispBM. If you spot problems I am very thankful to hear about it.