Fork me on GitHub
Documentation: Index > Hacking > Writing A Language… > Complex Scanner

Complex hand-written scanners

Hand written scanners should subclass LumiousScanner.

Scanning occurs in the main() method. There are two things you have to worry about:

  1. advancing the scan pointer, which is done by calls to scan(), get(), etc
  2. recording the string segments you're matching as their relevant tokens. This is done by calling record($string, $token_name, $escaped?=false)
By the time you exit main(), the string should have been fully recorded. main() doesn't return anything.

Imagine in your langauge you need to keep track of a context (state) by tracking curly braces. The basic workflow looks something like this:

class MyScanner extends LuminousScanner {

  function init() {
    // set up any last-minute stuff in here

  function main() {
    while (!$this->eos()) {
      if ($this->scan('/some_pattern/') !== null) {
        $this->record($this->match(), 'TOKEN_NAME');
      elseif($this->scan('/some_other_pattern/') !== null ) {
        $this->record($this->match(), 'SOME_OTHER_TOKEN');
      else { // ensure we advance the scan pointer
        $this->record($this->get(), null);

Obviously, to make it worthwhile to use an explictly written scanner, you will be evaluating quite a lot of logic inbetween the calls to scan().

Examples: languages/json.php is a fairly simple scanner which implements its own loop explicitly and records stack based state information.