Lexical Structure

I wrote a lexer yesterday. It’s not very big.

slex1 :: Char -> String -> (STok, String)
slex1 c s =
  case (elem c soloists, lookup c openers, lookup c closers) of
    (True, _, _) -> (Solo c, s)
    (_, Just b, _)
      | (t, ‘|’ : u) <- span boring s -> (Open (b, Just t), u)
      | otherwise -> (Open (b, Nothing), s)
    (_, _, Just b) -> (Close (b, Nothing), s)
    _ | c == ‘ ‘ -> let (t, u) = span (== ‘ ‘) s in (Spc (1 + length t), u)
    _ | c == ‘|’ , (t, d : u) <- span boring s , Just b <- lookup d closers
      -> (Close (b, Just t), u)
    _ | c == ‘|’ -> (Solo c, s)
    _ -> let (t, u) = span boring s in (Sym (c : t), u)

Let me translate that into something a bit more human. I guess I should say that slex1 takes the first character of the input (which it will surely consume) separately from the rest of the input (which it may choose to consume) and returns the unconsumed input. Some characters, the soloists, form tokens by themselves. They are:

soloists = “\n\r\t,;!.”

Brackets are opened by (, [ or { and closed by }, ], or ). The soloists and bracket characters, together with space and | are considered ‘special’. Other characters are boring. Composite open brackets may be formed (foo|, [foo| or {foo|where foo is a possibly empty sequence of boring characters. The corresponding close brackets are |foo}, |foo], |foo). All other uses of | stand alone (so when an open is encountered, we look ahead for a |, and when a | is encountered, we look ahead for a close).

Contiguous boring characters form an individual token.

Doubtless, this wil become more complex in time (did somebody say ‘literals’?) but I reckon it’s good to go with more space and less fuss.

It’s possible that, in time, unicode will be tolerated. But let there be no presumption of unicode by the language definition. It’s just a supply of more boring characters. Of course, it would be nice to allow → for -> and that sort of thing. (We could even make the elaborator respond with unicode if it’s fed some, but keep ascii otherwise.) I routinely fail to read Agda files, to say nothing of the barfs that can happen when raw unicode shows up in LaTeX source. We cannot afford to teach only the early adopters.

By the way, lexing is total and invertible. We keep all layout information, as befits a transducer. The ‘annoying coauthor’ model does not extend to reformatting all your code for you by parsing and prettyprinting. In that respect, it will be a less annoying coauthor than me.

2D nonsense rides again!

Besides, if we want to weird people out, we can do it perfectly well in ascii. I’ve just rebuilt Epigram 1 style bracketing, except that this time it isn’t compulsory. After lexing, we scan the document, collapsing bracketed regions to a single document element with another document inside, moving from a sequence of tokens to a tree. That is, a document is a sequence of lines; a line is a sequence of elements; an element is a simple token or a bracket containing a document.

The special character ! may be used to suspend an open bracket, which may then be resumed, also by !, on the next line. So


    ( x, y : Nat  !
let !-------------!
    ! x + y : Nat )

is a bold way of saying

let (x, y : Nat --- x + y : Nat)

It’s quite easy to read the stream and generate the document. At any given point, there’s a tree of currently active brackets with text regions in between and the cursor is at such a region: yes, it’s the zipper, again. Ordinary tokens get added to the current region. An open bracket creates a new subnode and moves inside. A close bracket collapses the current region to an element and inserts it like an ordinary token in the enclosing region. A ! moves right into the next region if there is one, or out into the enclosing region if not. A newline inserts a newline at the cursor, then moves left past any suspended brackets in the current region (but not out of any unsuspended brackets).

We’ll support rule-style declarations (with or without 2D notation). Please note that horizontal alignment plays no role in this process. Regions are delimited by brackets and !s, however scraggily arranged. In Epigram 1, it was compulsory to close or suspend all brackets by the end of the physical line (and my dodgy editor enforced this policy): this is not the case here. A physical newline inserts a logical newline into whichever region is open at the time: we remain inside a bracket which has not been suspended.

So, that’s document structure and lexical structure…whatever shall we put in the middle?

By the way, the source for this lot is here, weighing in at 128 lines of totally undocumented code. The basic method is to build a structure editor for document-zippers, then drive it from the token stream.

One Response to “Lexical Structure”

  1. Adam M says:

    “2D nonsense rides again!” — wow, that is a simple and nifty idea for improving legibility! And without resorting to whitespace-sensitivity. Too bad it’s buried half-way in the middle of the blog post (which might cause people to skip it).

    If Coq supported that, I’d use it. But I won’t hold my breath :)

    It would also be nice if you could have multiple rules lined up horizontally, though that would necessitate using a different character for the left and right sides (rather than “!” for both).

    I know it sounds silly, but there’s something productivity-enhancing about being able to fit all of your type system’s rules on the screen at once.

Leave a Reply