← All docs

The Lexer

How raw text becomes tokens.

Source: src/lexer/mod.rs, scan.rs, cursor.rs, token.rs, literals.rs

The lexer (also called tokenizer or scanner) is the first phase of compilation. It takes raw source text and breaks it into tokens — the smallest meaningful units of the language.

What is a token?

A token is a categorized piece of text. The lexer doesn’t care about structure (that’s the parser’s job) — it just identifies what each piece of text is:

<?php $x = 42 + $y;
OpenTag         "<?php"
Variable("x")   "$x"
Assign           "="
IntLiteral(42)   "42"
Plus             "+"
Variable("y")   "$y"
Semicolon        ";"
Eof

Some tokens carry data (the variable name, the number value). Others are just markers (semicolon, plus sign).

The Cursor

File: src/lexer/cursor.rs

The cursor is the lowest-level component. It walks through the source text one character at a time, tracking the current position:

pub struct Cursor<'a> {
    bytes: &'a [u8],    // the source text as bytes
    pos: usize,         // current byte offset
    line: usize,        // current line number (1-indexed)
    col: usize,         // current column (1-indexed)
}

It provides three essential operations:

MethodWhat it does
peek()Look at the current character without moving
advance()Move to the next character, return the one we just passed
remaining()Get the rest of the source as a string slice

The cursor automatically tracks line and column — when it sees a \n, it increments line and resets col to 1. This information is stored in a Span and attached to every token, so error messages can say “error at line 5, column 12”.

The Scanner

File: src/lexer/scan.rs

The scanner is the main loop. It uses the cursor to read characters and decides what token each sequence represents:

pub fn scan_tokens(source: &str) -> Result<Vec<(Token, Span)>, CompileError> {
    // 1. Skip whitespace
    // 2. Must start with <?php
    // 3. Loop: skip whitespace, look at next char, produce a token
    // 4. When EOF reached, push Eof token and return
}

The public lexer entry point used by the rest of the compiler is tokenize() in src/lexer/mod.rs, which wraps scan_tokens() and returns the final Vec<(Token, Span)>.

The scanning algorithm

For each token, the scanner looks at the current character and decides:

  1. " → Start of a double-quoted string. May contain interpolation ($var). Delegates to literals::scan_double_string_interpolated().

  2. ' → Start of a single-quoted string. No interpolation, only \\ and \' escapes.

  3. Digit → Start of a number. 0x / 0X introduces a hexadecimal integer literal; 0o / 0O an explicit octal (PHP 8.1+, digits 07); 0b / 0B a binary (PHP 5.4+, digits 01). Otherwise read decimal digits, with . followed by a digit promoting to a float, plus e / E for scientific notation. Integer literals that start with 0 and do not become floats are PHP legacy octal literals, so 0777 and 0_777 both produce decimal 511, while 078 is rejected. A single _ between digits acts as a visual separator (PHP 7.4+) and is stripped before parsing — separators may not appear at the start, end, or doubled. After a literal, any trailing alphanumeric character or _ is rejected (catches 0o78, 078, 0xfg, 0b12, 1_, 1__0).

  4. $ → Start of a variable. Read the name (letters, digits, underscores).

  5. Letter or _ → Start of an identifier or keyword. Read the full word, then check if it’s a keyword (if, while, echo, function, etc.) or a plain identifier (function name).

  6. Operator characters (+, -, *, /, =, <, >, !, ., %, &, |, ^, ~, ?, @) → Look ahead to handle multi-character operators (==, ===, !=, !==, <=, >=, <=>, <<, >>, &&, ||, **, ++, --, ?->, ??, ??=, |>, +=, -=, *=, **=, /=, .=, %=, &=, |=, ^=, <<=, >>=). Note that < may lead to <=, <=>, <<, <<=, or <<< (heredoc/nowdoc — see below).

  7. Structural characters ((, ), {, }, [, ], ;, ,, ?, :, \) → Single-character tokens. Note that ? followed by another ? produces ??, and ?? followed by = produces ??=. \ is tokenized separately so the parser can build qualified and fully-qualified namespace names.

Whitespace and comments

Before each token, the scanner skips:

  • Whitespace: spaces, tabs, newlines
  • Line comments: // through end of line
  • Block comments: /* through */

These are discarded entirely — they don’t produce tokens.

Token types

File: src/lexer/token.rs

The full set of tokens elephc recognizes:

Complete Token enum variants:

OpenTag  Semicolon  LParen  RParen  LBrace  RBrace  StringLiteral  IntLiteral
FloatLiteral  Variable  Identifier  Echo  If  IfDef  Else  ElseIf
While  For  Break  Continue  Function  Return  True  False
Null  Do  Foreach  As  Try  Catch  Finally  Throw
Extends  Implements  Interface  Abstract  Final  Inf  Nan  PhpIntMax
PhpIntMin  PhpFloatMax  MPi  ME  MSqrt2  MPi2  MPi4  MLog2e
MLog10e  PhpFloatMin  PhpFloatEpsilon  Print  Switch  Case  Default  Match
Include  IncludeOnce  Require  RequireOnce  Stdin  Stdout  Stderr  Fn
Use  Namespace  Const  Global  Static  Self_  Trait  Parent
InsteadOf  InstanceOf  PhpEol  PhpOs  DirectorySeparator  DunderDir  DunderFile  DunderLine  DunderFunction
DunderClass  DunderMethod  DunderNamespace  DunderTrait  Class  Enum  New  Public
Protected  Private  ReadOnly  This  Extern  Packed  Yield  AttrOpen  Assign  DoubleArrow
Plus  Minus  Star  StarStar  Slash  Percent  Dot  Comma
Backslash  LBracket  RBracket  Question  Colon  PlusAssign  MinusAssign  StarAssign
StarStarAssign  SlashAssign  DotAssign  PercentAssign  AmpAssign  PipeAssign  CaretAssign  LessLessAssign
GreaterGreaterAssign  PlusPlus  MinusMinus  AndAnd  OrOr  And  Or  Xor
Bang  EqualEqual  EqualEqualEqual  NotEqual  NotEqualEqual  Less  Greater  LessEqual
GreaterEqual  Spaceship  Ampersand  Pipe  Caret  Tilde  At  LessLess  GreaterGreater
Arrow  QuestionArrow  DoubleColon  QuestionQuestion  QuestionQuestionAssign  PipeArrow  Ellipsis  Eof

Literals

TokenExampleCarries
IntLiteral42, 0xFF, 0755, 0o755, 0b1010, 1_000_000i64 value (decimal, hex 0x/0X, legacy octal with leading 0, explicit octal 0o/0O, binary 0b/0B; _ allowed between digits)
FloatLiteral3.14, .5, 1e3, 1_000.5, 1e1_0f64 value (_ allowed between digits in mantissa or exponent)
StringLiteral"hello", 'world'String content (escapes resolved)

Variables and identifiers

TokenExampleCarries
Variable$x, $name, $argcName without the $
Identifierstrlen, my_funcThe name
TokenExampleCarries
This$thisSelf-reference inside a class method

Keywords

echo  if  else  elseif  while  do  for  foreach  as
break  continue  function  return  include  require
include_once  require_once  true  false  null  print
switch  case  default  match  try  catch  finally  throw  yield  fn  use  namespace  ifdef  extern  const
global  static  self  class  abstract  final  interface  trait  extends  implements  new
public  protected  private  readonly  parent  insteadof  instanceof  enum  packed

Each keyword is a distinct token variant (e.g., Token::If, Token::While, Token::Switch). Multi-word keyword spellings use camel-cased variants such as Token::IncludeOnce and Token::RequireOnce; readonly is Token::ReadOnly.

Constants (keyword tokens)

INF  NAN  PHP_INT_MAX  PHP_INT_MIN  PHP_FLOAT_MAX  PHP_FLOAT_MIN  PHP_FLOAT_EPSILON
M_PI  M_E  M_SQRT2  M_PI_2  M_PI_4  M_LOG2E  M_LOG10E
PHP_EOL  PHP_OS  DIRECTORY_SEPARATOR
STDIN  STDOUT  STDERR

These are recognized as distinct tokens by the lexer, not as identifiers. Their variants include forms such as Token::Inf, Token::Nan, Token::PhpIntMax, Token::PhpFloatEpsilon, and Token::DirectorySeparator.

Magic constants

__DIR__  __FILE__  __LINE__  __FUNCTION__
__CLASS__  __METHOD__  __NAMESPACE__  __TRAIT__

PHP magic constants are tokenized as Token::DunderDir, Token::DunderFile, Token::DunderLine, Token::DunderFunction, Token::DunderClass, Token::DunderMethod, Token::DunderNamespace, and Token::DunderTrait. They are matched case-insensitively, so __dir__, __DIR__, and __DiR__ all produce the same token.

Operators

+  -  *  **  /  %  .
=  =>  +=  -=  *=  **=  /=  .=  %=  &=  |=  ^=  <<=  >>=  ??=
==  ===  !=  !==  <  >  <=  >=  <=>
&&  ||  and  or  xor  !
instanceof
&  |  ^  ~  @  <<  >>
??  ->  ::
|>
++  --
...

Structural

(  )  {  }  [  ]  ;  ,  ?  :  \

Special

TokenMeaning
OpenTag<?php — required at the start of every file
AttrOpen#[ — starts a PHP 8 attribute group; bare # is a line comment
EofEnd of file — signals the parser to stop

String interpolation

Double-quoted strings can contain variables:

"Hello, $name!"

The lexer doesn’t produce a single string token for this. Instead, it emits a sequence that the parser can assemble:

StringLiteral("Hello, ")
Dot                        (implicit concatenation)
Variable("name")
Dot
StringLiteral("!")

This is handled by literals::scan_double_string_interpolated(), which walks through the string character by character, splitting it whenever it encounters a $ followed by a valid identifier.

Escape sequences

In double-quoted strings, the lexer resolves escape sequences during scanning:

EscapeBecomes
\nNewline (0x0A)
\tTab (0x09)
\rCarriage return (0x0D)
\\Literal backslash
\"Literal double quote
\$Literal dollar sign
\0Null byte (0x00)
\eEscape (0x1B)

Single-quoted strings only support \\ and \' — everything else is literal.

Heredoc and nowdoc

File: src/lexer/literals.rsscan_heredoc()

Heredoc and nowdoc are multi-line string syntaxes from PHP. The lexer recognizes <<< followed by a label:

$s = <<<EOT
Hello, $name!
Multiple lines here.
EOT;

The closing label must appear on its own line (optionally indented). The lexer handles three forms:

  • Heredoc (<<<LABEL): processes escape sequences (like double-quoted strings) and produces a StringLiteral token
  • Quoted heredoc (<<<"LABEL"): identical behavior to unquoted heredoc
  • Nowdoc (<<<'LABEL'): no escape processing (like single-quoted strings)

The scanner reads everything between the opening label line and the closing label, strips any common leading indentation (matching the closing label’s indent), and returns a single StringLiteral token. String interpolation within heredoc strings is handled the same way as in double-quoted strings.

Note that <<< is distinguished from << (shift left) by checking whether the third character is also <.

Error handling

If the scanner encounters something it can’t tokenize — like an unterminated string or an invalid character — it returns a CompileError with the exact position (line and column). The error message is formatted by src/errors/report.rs.

How it connects

The lexer’s output — Vec<(Token, Span)> — is the input to the parser. Every token carries its position in the source, so later phases can point errors back to the exact line and column.

Source text → Lexer → [(Token, Span), (Token, Span), ...] → Parser