Can you compile PHP to a native binary?

Yes. elephc is a PHP compiler that takes a supported static subset of PHP and compiles it directly to native assembly, producing standalone binaries for macOS ARM64, Linux ARM64, and Linux x86_64. The compiled path uses no interpreter, VM, Zend Engine, or opcode fallback.

How is elephc different from PHP packagers or other PHP AOT systems?

elephc is not a PHP packager and does not bundle Zend/PHP. It is a from-scratch compiler: parse, type-check, lower to native assembly, assemble, link, and include a small runtime in the output. If elephc compiles it, it runs as native code.

Can elephc work with PHP frameworks?

Not a drop-in replacement for an entire dynamic framework today, but a path to accelerate static modules inside existing PHP applications. The roadmap is hybrid: keep frameworks such as WordPress, Laravel, or Symfony on PHP, then compile static and performance-sensitive modules into standalone binaries today and native libraries or PHP extensions in the future.

How does the elephc PHP compiler work?

elephc uses a six-phase compilation pipeline: Lexer (source to tokens), Parser (tokens to AST), Resolver (include/require), Type Checker (static type inference), Code Generator (AST to native assembly), and Linker (assembly to native executable). The compiler is written in Rust and targets macOS ARM64, Linux ARM64, and Linux x86_64.

What PHP features does elephc support?

elephc supports int, float, string, bool, null, and array types, along with control flow (if/else, while, for, foreach, switch, match), functions with recursion, 100+ built-in functions (string, array, math, I/O), type casting, string interpolation, include/require, and more.

Is elephc free and open source?

Yes, elephc is free and open-source software released under the MIT License. The source code is available on GitHub at github.com/illegalstudio/elephc.

The Lexer — elephc docs

Source: src/lexer/ — mod.rs, scan.rs, cursor.rs, token.rs, literals.rs

The lexer (also called tokenizer or scanner) is the first phase of compilation. It takes raw source text and breaks it into tokens — the smallest meaningful units of the language.

What is a token?

A token is a categorized piece of text. The lexer doesn’t care about structure (that’s the parser’s job) — it just identifies what each piece of text is:

<?php $x = 42 + $y;

OpenTag         "<?php"
Variable("x")   "$x"
Assign           "="
IntLiteral(42)   "42"
Plus             "+"
Variable("y")   "$y"
Semicolon        ";"
Eof

Some tokens carry data (the variable name, the number value). Others are just markers (semicolon, plus sign).

The Cursor

File: src/lexer/cursor.rs

The cursor is the lowest-level component. It walks through the source text one character at a time, tracking the current position:

pub struct Cursor<'a> {
    bytes: &'a [u8],    // the source text as bytes
    pos: usize,         // current byte offset
    line: usize,        // current line number (1-indexed)
    col: usize,         // current column (1-indexed)
}

It provides three essential operations:

Method	What it does
`peek()`	Look at the current character without moving
`advance()`	Move to the next character, return the one we just passed
`remaining()`	Get the rest of the source as a string slice

The cursor automatically tracks line and column — when it sees a \n, it increments line and resets col to 1. This information is stored in a Span and attached to every token, so error messages can say “error at line 5, column 12”.

The Scanner

File: src/lexer/scan.rs

The scanner is the main loop. It uses the cursor to read characters and decides what token each sequence represents:

pub fn scan_tokens(source: &str) -> Result<Vec<(Token, Span)>, CompileError> {
    // 1. Skip whitespace
    // 2. Must start with <?php
    // 3. Loop: skip whitespace, look at next char, produce a token
    // 4. When EOF reached, push Eof token and return
}

The public lexer entry point used by the rest of the compiler is tokenize() in src/lexer/mod.rs, which wraps scan_tokens() and returns the final Vec<(Token, Span)>.

The scanning algorithm

For each token, the scanner looks at the current character and decides:

" → Start of a double-quoted string. May contain interpolation ($var). Delegates to literals::scan_double_string_interpolated().
' → Start of a single-quoted string. No interpolation, only \\ and \' escapes.
Digit → Start of a number. 0x / 0X introduces a hexadecimal integer literal; 0o / 0O an explicit octal (PHP 8.1+, digits 0–7); 0b / 0B a binary (PHP 5.4+, digits 0–1). Otherwise read decimal digits, with . followed by a digit promoting to a float, plus e / E for scientific notation. Integer literals that start with 0 and do not become floats are PHP legacy octal literals, so 0777 and 0_777 both produce decimal 511, while 078 is rejected. A single _ between digits acts as a visual separator (PHP 7.4+) and is stripped before parsing — separators may not appear at the start, end, or doubled. After a literal, any trailing alphanumeric character or _ is rejected (catches 0o78, 078, 0xfg, 0b12, 1_, 1__0).
$ → Start of a variable. Read the name (letters, digits, underscores).
Letter or _ → Start of an identifier or keyword. Read the full word, then check if it’s a keyword (if, while, echo, function, etc.) or a plain identifier (function name).
Operator characters (+, -, *, /, =, <, >, !, ., %, &, |, ^, ~, ?, @) → Look ahead to handle multi-character operators (==, ===, !=, !==, <=, >=, <=>, <<, >>, &&, ||, **, ++, --, ?->, ??, ??=, |>, +=, -=, *=, **=, /=, .=, %=, &=, |=, ^=, <<=, >>=). Note that < may lead to <=, <=>, <<, <<=, or <<< (heredoc/nowdoc — see below).
Structural characters ((, ), {, }, [, ], ;, ,, ?, :, \) → Single-character tokens. Note that ? followed by another ? produces ??, and ?? followed by = produces ??=. \ is tokenized separately so the parser can build qualified and fully-qualified namespace names.

Whitespace and comments

Before each token, the scanner skips:

Whitespace: spaces, tabs, newlines
Line comments: // through end of line
Block comments: /* through */

These are discarded entirely — they don’t produce tokens.

Token types

File: src/lexer/token.rs

The full set of tokens elephc recognizes:

Complete Token enum variants:

OpenTag  Semicolon  LParen  RParen  LBrace  RBrace  StringLiteral  IntLiteral
FloatLiteral  Variable  Identifier  Echo  If  IfDef  Else  ElseIf
While  For  Break  Continue  Function  Return  True  False
Null  Do  Foreach  As  Try  Catch  Finally  Throw
Extends  Implements  Interface  Abstract  Final  Inf  Nan  PhpIntMax
PhpIntMin  PhpFloatMax  MPi  ME  MSqrt2  MPi2  MPi4  MLog2e
MLog10e  PhpFloatMin  PhpFloatEpsilon  Print  Switch  Case  Default  Match
Include  IncludeOnce  Require  RequireOnce  Stdin  Stdout  Stderr  Fn
Use  Namespace  Const  Global  Static  Self_  Trait  Parent
InsteadOf  InstanceOf  PhpEol  PhpOs  DirectorySeparator  DunderDir  DunderFile  DunderLine  DunderFunction
DunderClass  DunderMethod  DunderNamespace  DunderTrait  Class  Enum  New  Public
Protected  Private  ReadOnly  This  Extern  Packed  Yield  AttrOpen  Assign  DoubleArrow
Plus  Minus  Star  StarStar  Slash  Percent  Dot  Comma
Backslash  LBracket  RBracket  Question  Colon  PlusAssign  MinusAssign  StarAssign
StarStarAssign  SlashAssign  DotAssign  PercentAssign  AmpAssign  PipeAssign  CaretAssign  LessLessAssign
GreaterGreaterAssign  PlusPlus  MinusMinus  AndAnd  OrOr  And  Or  Xor
Bang  EqualEqual  EqualEqualEqual  NotEqual  NotEqualEqual  Less  Greater  LessEqual
GreaterEqual  Spaceship  Ampersand  Pipe  Caret  Tilde  At  LessLess  GreaterGreater
Arrow  QuestionArrow  DoubleColon  QuestionQuestion  QuestionQuestionAssign  PipeArrow  Ellipsis  Eof

Literals

Token	Example	Carries
`IntLiteral`	`42`, `0xFF`, `0755`, `0o755`, `0b1010`, `1_000_000`	`i64` value (decimal, hex `0x`/`0X`, legacy octal with leading `0`, explicit octal `0o`/`0O`, binary `0b`/`0B`; `_` allowed between digits)
`FloatLiteral`	`3.14`, `.5`, `1e3`, `1_000.5`, `1e1_0`	`f64` value (`_` allowed between digits in mantissa or exponent)
`StringLiteral`	`"hello"`, `'world'`	`String` content (escapes resolved)

Variables and identifiers

Token	Example	Carries
`Variable`	`$x`, `$name`, `$argc`	Name without the `$`
`Identifier`	`strlen`, `my_func`	The name

Token	Example	Carries
`This`	`$this`	Self-reference inside a class method

Keywords

echo  if  else  elseif  while  do  for  foreach  as
break  continue  function  return  include  require
include_once  require_once  true  false  null  print
switch  case  default  match  try  catch  finally  throw  yield  fn  use  namespace  ifdef  extern  const
global  static  self  class  abstract  final  interface  trait  extends  implements  new
public  protected  private  readonly  parent  insteadof  instanceof  enum  packed

Each keyword is a distinct token variant (e.g., Token::If, Token::While, Token::Switch). Multi-word keyword spellings use camel-cased variants such as Token::IncludeOnce and Token::RequireOnce; readonly is Token::ReadOnly.

Constants (keyword tokens)

INF  NAN  PHP_INT_MAX  PHP_INT_MIN  PHP_FLOAT_MAX  PHP_FLOAT_MIN  PHP_FLOAT_EPSILON
M_PI  M_E  M_SQRT2  M_PI_2  M_PI_4  M_LOG2E  M_LOG10E
PHP_EOL  PHP_OS  DIRECTORY_SEPARATOR
STDIN  STDOUT  STDERR

These are recognized as distinct tokens by the lexer, not as identifiers. Their variants include forms such as Token::Inf, Token::Nan, Token::PhpIntMax, Token::PhpFloatEpsilon, and Token::DirectorySeparator.

Magic constants

__DIR__  __FILE__  __LINE__  __FUNCTION__
__CLASS__  __METHOD__  __NAMESPACE__  __TRAIT__

PHP magic constants are tokenized as Token::DunderDir, Token::DunderFile, Token::DunderLine, Token::DunderFunction, Token::DunderClass, Token::DunderMethod, Token::DunderNamespace, and Token::DunderTrait. They are matched case-insensitively, so __dir__, __DIR__, and __DiR__ all produce the same token.

Operators

+  -  *  **  /  %  .
=  =>  +=  -=  *=  **=  /=  .=  %=  &=  |=  ^=  <<=  >>=  ??=
==  ===  !=  !==  <  >  <=  >=  <=>
&&  ||  and  or  xor  !
instanceof
&  |  ^  ~  @  <<  >>
??  ->  ::
|>
++  --
...

Structural

(  )  {  }  [  ]  ;  ,  ?  :  \

Special

Token	Meaning
`OpenTag`	`<?php` — required at the start of every file
`AttrOpen`	`#[` — starts a PHP 8 attribute group; bare `#` is a line comment
`Eof`	End of file — signals the parser to stop

String interpolation

Double-quoted strings can contain variables:

"Hello, $name!"

The lexer doesn’t produce a single string token for this. Instead, it emits a sequence that the parser can assemble:

StringLiteral("Hello, ")
Dot                        (implicit concatenation)
Variable("name")
Dot
StringLiteral("!")

This is handled by literals::scan_double_string_interpolated(), which walks through the string character by character, splitting it whenever it encounters a $ followed by a valid identifier.

Escape sequences

In double-quoted strings, the lexer resolves escape sequences during scanning:

Escape	Becomes
`\n`	Newline (0x0A)
`\t`	Tab (0x09)
`\r`	Carriage return (0x0D)
`\\`	Literal backslash
`\"`	Literal double quote
`\$`	Literal dollar sign
`\0`	Null byte (0x00)
`\e`	Escape (0x1B)

Single-quoted strings only support \\ and \' — everything else is literal.

Heredoc and nowdoc

File: src/lexer/literals.rs — scan_heredoc()

Heredoc and nowdoc are multi-line string syntaxes from PHP. The lexer recognizes <<< followed by a label:

$s = <<<EOT
Hello, $name!
Multiple lines here.
EOT;

The closing label must appear on its own line (optionally indented). The lexer handles three forms:

Heredoc (<<<LABEL): processes escape sequences (like double-quoted strings) and produces a StringLiteral token
Quoted heredoc (<<<"LABEL"): identical behavior to unquoted heredoc
Nowdoc (<<<'LABEL'): no escape processing (like single-quoted strings)

The scanner reads everything between the opening label line and the closing label, strips any common leading indentation (matching the closing label’s indent), and returns a single StringLiteral token. String interpolation within heredoc strings is handled the same way as in double-quoted strings.

Note that <<< is distinguished from << (shift left) by checking whether the third character is also <.

Error handling

If the scanner encounters something it can’t tokenize — like an unterminated string or an invalid character — it returns a CompileError with the exact position (line and column). The error message is formatted by src/errors/report.rs.

How it connects

The lexer’s output — Vec<(Token, Span)> — is the input to the parser. Every token carries its position in the source, so later phases can point errors back to the exact line and column.

Source text → Lexer → [(Token, Span), (Token, Span), ...] → Parser

The Lexer