Source: src/lexer/ — mod.rs, scan.rs, cursor.rs, token.rs, literals.rs
The lexer (also called tokenizer or scanner) is the first phase of compilation. It takes raw source text and breaks it into tokens — the smallest meaningful units of the language.
What is a token?
A token is a categorized piece of text. The lexer doesn’t care about structure (that’s the parser’s job) — it just identifies what each piece of text is:
<?php $x = 42 + $y;
OpenTag "<?php"
Variable("x") "$x"
Assign "="
IntLiteral(42) "42"
Plus "+"
Variable("y") "$y"
Semicolon ";"
Eof
Some tokens carry data (the variable name, the number value). Others are just markers (semicolon, plus sign).
The Cursor
File: src/lexer/cursor.rs
The cursor is the lowest-level component. It walks through the source text one character at a time, tracking the current position:
pub struct Cursor<'a> {
bytes: &'a [u8], // the source text as bytes
pos: usize, // current byte offset
line: usize, // current line number (1-indexed)
col: usize, // current column (1-indexed)
}
It provides three essential operations:
| Method | What it does |
|---|---|
peek() | Look at the current character without moving |
advance() | Move to the next character, return the one we just passed |
remaining() | Get the rest of the source as a string slice |
The cursor automatically tracks line and column — when it sees a \n, it increments line and resets col to 1. This information is stored in a Span and attached to every token, so error messages can say “error at line 5, column 12”.
The Scanner
File: src/lexer/scan.rs
The scanner is the main loop. It uses the cursor to read characters and decides what token each sequence represents:
pub fn scan_tokens(source: &str) -> Result<Vec<(Token, Span)>, CompileError> {
// 1. Skip whitespace
// 2. Must start with <?php
// 3. Loop: skip whitespace, look at next char, produce a token
// 4. When EOF reached, push Eof token and return
}
The public lexer entry point used by the rest of the compiler is tokenize() in src/lexer/mod.rs, which wraps scan_tokens() and returns the final Vec<(Token, Span)>.
The scanning algorithm
For each token, the scanner looks at the current character and decides:
-
"→ Start of a double-quoted string. May contain interpolation ($var). Delegates toliterals::scan_double_string_interpolated(). -
'→ Start of a single-quoted string. No interpolation, only\\and\'escapes. -
Digit → Start of a number.
0x/0Xintroduces a hexadecimal integer literal;0o/0Oan explicit octal (PHP 8.1+, digits0–7);0b/0Ba binary (PHP 5.4+, digits0–1). Otherwise read decimal digits, with.followed by a digit promoting to a float, pluse/Efor scientific notation. Integer literals that start with0and do not become floats are PHP legacy octal literals, so0777and0_777both produce decimal511, while078is rejected. A single_between digits acts as a visual separator (PHP 7.4+) and is stripped before parsing — separators may not appear at the start, end, or doubled. After a literal, any trailing alphanumeric character or_is rejected (catches0o78,078,0xfg,0b12,1_,1__0). -
$→ Start of a variable. Read the name (letters, digits, underscores). -
Letter or
_→ Start of an identifier or keyword. Read the full word, then check if it’s a keyword (if,while,echo,function, etc.) or a plain identifier (function name). -
Operator characters (
+,-,*,/,=,<,>,!,.,%,&,|,^,~,?,@) → Look ahead to handle multi-character operators (==,===,!=,!==,<=,>=,<=>,<<,>>,&&,||,**,++,--,?->,??,??=,|>,+=,-=,*=,**=,/=,.=,%=,&=,|=,^=,<<=,>>=). Note that<may lead to<=,<=>,<<,<<=, or<<<(heredoc/nowdoc — see below). -
Structural characters (
(,),{,},[,],;,,,?,:,\) → Single-character tokens. Note that?followed by another?produces??, and??followed by=produces??=.\is tokenized separately so the parser can build qualified and fully-qualified namespace names.
Whitespace and comments
Before each token, the scanner skips:
- Whitespace: spaces, tabs, newlines
- Line comments:
//through end of line - Block comments:
/*through*/
These are discarded entirely — they don’t produce tokens.
Token types
File: src/lexer/token.rs
The full set of tokens elephc recognizes:
Complete Token enum variants:
OpenTag Semicolon LParen RParen LBrace RBrace StringLiteral IntLiteral
FloatLiteral Variable Identifier Echo If IfDef Else ElseIf
While For Break Continue Function Return True False
Null Do Foreach As Try Catch Finally Throw
Extends Implements Interface Abstract Final Inf Nan PhpIntMax
PhpIntMin PhpFloatMax MPi ME MSqrt2 MPi2 MPi4 MLog2e
MLog10e PhpFloatMin PhpFloatEpsilon Print Switch Case Default Match
Include IncludeOnce Require RequireOnce Stdin Stdout Stderr Fn
Use Namespace Const Global Static Self_ Trait Parent
InsteadOf InstanceOf PhpEol PhpOs DirectorySeparator DunderDir DunderFile DunderLine DunderFunction
DunderClass DunderMethod DunderNamespace DunderTrait Class Enum New Public
Protected Private ReadOnly This Extern Packed Yield AttrOpen Assign DoubleArrow
Plus Minus Star StarStar Slash Percent Dot Comma
Backslash LBracket RBracket Question Colon PlusAssign MinusAssign StarAssign
StarStarAssign SlashAssign DotAssign PercentAssign AmpAssign PipeAssign CaretAssign LessLessAssign
GreaterGreaterAssign PlusPlus MinusMinus AndAnd OrOr And Or Xor
Bang EqualEqual EqualEqualEqual NotEqual NotEqualEqual Less Greater LessEqual
GreaterEqual Spaceship Ampersand Pipe Caret Tilde At LessLess GreaterGreater
Arrow QuestionArrow DoubleColon QuestionQuestion QuestionQuestionAssign PipeArrow Ellipsis Eof
Literals
| Token | Example | Carries |
|---|---|---|
IntLiteral | 42, 0xFF, 0755, 0o755, 0b1010, 1_000_000 | i64 value (decimal, hex 0x/0X, legacy octal with leading 0, explicit octal 0o/0O, binary 0b/0B; _ allowed between digits) |
FloatLiteral | 3.14, .5, 1e3, 1_000.5, 1e1_0 | f64 value (_ allowed between digits in mantissa or exponent) |
StringLiteral | "hello", 'world' | String content (escapes resolved) |
Variables and identifiers
| Token | Example | Carries |
|---|---|---|
Variable | $x, $name, $argc | Name without the $ |
Identifier | strlen, my_func | The name |
Class-related keywords and special names
| Token | Example | Carries |
|---|---|---|
This | $this | Self-reference inside a class method |
Keywords
echo if else elseif while do for foreach as
break continue function return include require
include_once require_once true false null print
switch case default match try catch finally throw yield fn use namespace ifdef extern const
global static self class abstract final interface trait extends implements new
public protected private readonly parent insteadof instanceof enum packed
Each keyword is a distinct token variant (e.g., Token::If, Token::While, Token::Switch). Multi-word keyword spellings use camel-cased variants such as Token::IncludeOnce and Token::RequireOnce; readonly is Token::ReadOnly.
Constants (keyword tokens)
INF NAN PHP_INT_MAX PHP_INT_MIN PHP_FLOAT_MAX PHP_FLOAT_MIN PHP_FLOAT_EPSILON
M_PI M_E M_SQRT2 M_PI_2 M_PI_4 M_LOG2E M_LOG10E
PHP_EOL PHP_OS DIRECTORY_SEPARATOR
STDIN STDOUT STDERR
These are recognized as distinct tokens by the lexer, not as identifiers. Their variants include forms such as Token::Inf, Token::Nan, Token::PhpIntMax, Token::PhpFloatEpsilon, and Token::DirectorySeparator.
Magic constants
__DIR__ __FILE__ __LINE__ __FUNCTION__
__CLASS__ __METHOD__ __NAMESPACE__ __TRAIT__
PHP magic constants are tokenized as Token::DunderDir, Token::DunderFile, Token::DunderLine, Token::DunderFunction, Token::DunderClass, Token::DunderMethod, Token::DunderNamespace, and Token::DunderTrait. They are matched case-insensitively, so __dir__, __DIR__, and __DiR__ all produce the same token.
Operators
+ - * ** / % .
= => += -= *= **= /= .= %= &= |= ^= <<= >>= ??=
== === != !== < > <= >= <=>
&& || and or xor !
instanceof
& | ^ ~ @ << >>
?? -> ::
|>
++ --
...
Structural
( ) { } [ ] ; , ? : \
Special
| Token | Meaning |
|---|---|
OpenTag | <?php — required at the start of every file |
AttrOpen | #[ — starts a PHP 8 attribute group; bare # is a line comment |
Eof | End of file — signals the parser to stop |
String interpolation
Double-quoted strings can contain variables:
"Hello, $name!"
The lexer doesn’t produce a single string token for this. Instead, it emits a sequence that the parser can assemble:
StringLiteral("Hello, ")
Dot (implicit concatenation)
Variable("name")
Dot
StringLiteral("!")
This is handled by literals::scan_double_string_interpolated(), which walks through the string character by character, splitting it whenever it encounters a $ followed by a valid identifier.
Escape sequences
In double-quoted strings, the lexer resolves escape sequences during scanning:
| Escape | Becomes |
|---|---|
\n | Newline (0x0A) |
\t | Tab (0x09) |
\r | Carriage return (0x0D) |
\\ | Literal backslash |
\" | Literal double quote |
\$ | Literal dollar sign |
\0 | Null byte (0x00) |
\e | Escape (0x1B) |
Single-quoted strings only support \\ and \' — everything else is literal.
Heredoc and nowdoc
File: src/lexer/literals.rs — scan_heredoc()
Heredoc and nowdoc are multi-line string syntaxes from PHP. The lexer recognizes <<< followed by a label:
$s = <<<EOT
Hello, $name!
Multiple lines here.
EOT;
The closing label must appear on its own line (optionally indented). The lexer handles three forms:
- Heredoc (
<<<LABEL): processes escape sequences (like double-quoted strings) and produces aStringLiteraltoken - Quoted heredoc (
<<<"LABEL"): identical behavior to unquoted heredoc - Nowdoc (
<<<'LABEL'): no escape processing (like single-quoted strings)
The scanner reads everything between the opening label line and the closing label, strips any common leading indentation (matching the closing label’s indent), and returns a single StringLiteral token. String interpolation within heredoc strings is handled the same way as in double-quoted strings.
Note that <<< is distinguished from << (shift left) by checking whether the third character is also <.
Error handling
If the scanner encounters something it can’t tokenize — like an unterminated string or an invalid character — it returns a CompileError with the exact position (line and column). The error message is formatted by src/errors/report.rs.
How it connects
The lexer’s output — Vec<(Token, Span)> — is the input to the parser. Every token carries its position in the source, so later phases can point errors back to the exact line and column.
Source text → Lexer → [(Token, Span), (Token, Span), ...] → Parser