The THP programming language specification

This page (and following pages in the future) define the language.

THP is a strong, statically typed programming language that is transpiled to PHP. It is designed to improve on PHP’s shortcomings, mainly a better type system, better syntax and semantics, and better integration with tooling.

Compiler architecture

Warning

This is subject to change. We are still rewriting from Rust to Zig.

The compiler will have 5 phases:

Source code representation

Source code must be UTF-8 encoded. Any non-ASCII bytes appearing within a string literal in source code carry their UTF-8 meaning into the content of the string, the bytes are not modified by the compiler.

Furthermore, THP only recognizes LF as line terminator. Using CRLF will lead to a compiler error.

Grammar syntax

This document uses a modified version of EBNF which allows the use of RegExp-like modifiers. An example is as follows:

; single line comments

literal       = "a"

; ranges iterate over ASCII codepoints
range         = "0".."9"

production_1  = character
concatenation = production_1, production_2

alternation   = "a" | "b"

alternation_2 = "abc"
              | "jkl"
              | "xyz"

grouping      = ("123", "456")

zero_or_one   = production?
zero_or_more  = production*
one_or_more   = production+

Whitespace & Automatic Semicolon Insertion

Altough not yet implemented, THP will not use semicolons as statements delimitors. Instead, new lines will serve as statement delimitors.

THP is whitespace insensitive. However, THP has special rules when handling statement termination in order to not use semicolons.

Certain statements have clearly defined markers of termination. For example, an if statement always has braces {}, so the closing brace } is the terminator. The same with parenthesis, square brackets, etc.

Other statements require a explicit terminator. For example, the assignment statement:

val computation = 123 + 456  // how to detect if the statement ends here
* 789                        // or extends up to here?thp

In other languages a semicolon would be used to signal the end of the statement:

int computation = 123 + 456
* 789;

THP does not use semicolons. Instead, THP has 1 strict rule and 1 exception to the rule:

All statements end with a newline

No matter the indentation, whitespace or others, every statement ends with a newline.

val compute = 1 + 2 * 3 / 4
// statement ends here     ↑thp

As mentioned before, this does not affect statements that have clear delimiters. For example, the following code will work as expected:

val compute = my_function(
  param1,
  param2,
) / 64
//    ↑ statement ends herethp

In a way, the parenthesis will “disable” the rule.

But how to have an statement span multiple lines?

Exception: operator on the next line.

If the next line begins with any operator, the statement of the previous line continues.

For example:

val computation = 123 + 456
* 789
//   ↑ statement ends here, and there is a single statementthp

This is so no matter the indentation:

// weird indentation:

  val computation = 123 + 456

- 789
// ↑ statement still ends herethp

What is important is that an operator begins the new line. If the operator is left on the previous line, this will not work:

//       statement ends here ↓, and now there is a syntax error (dangling operator)
val computation = 123 + 456 *
789
// ↑ this is a different statementthp

For this the parser must do look-ahead of 1 token. This is the only place the parser does so.

Basic characters

newline = "\n"
character = '\0'..'\255' ; any ASCII character

lowercase_letter = "a".."z"
uppercase_letter = "A".."Z"
underscore       = "_"
dot              = "."
comma            = ","

decimal_digit    = "0".."9"
binary_digit     = "0" | "1"
octal_digit      = "0".."7"
hex_digit        = "0".."9" | "a".."f" | "A".."F"

operator_char    = "+" | "-" | "=" | "*" | "!" | "/" | "|"
                 | "@" | "#" | "$" | "~" | "%" | "&" | "?"
                 | "<" | ">" | "^" | "." | ":"

Tokens

This is a summary of all tokens:

pub const TokenType = enum {
    Int,
    Float,
    Identifier,
    Datatype,
    Operator,
    Comment,
    String,
    // grouping signs
    LeftParen,
    RightParen,
    LeftBracket,
    RightBracket,
    LeftBrace,
    RightBrace,
    // punctiation that carries special meaning
    Comma,
    Newline,
    // Each keyword will have its own token
};

Number

A decimal integer cannot have a leading zero. This: 0644 is a lexic error. Floating point numbers, however, can have leading zeros: 0.6782e+2.

In PHP an integer with a leading zero is not a decimal number, it’s an octal number. So in PHP 0644 === 420. To avoid any confusion, decimal numbers cannot have a leading zero. Instead, all octal numbers must begin with either 0o or 0O.

Number = Int | Float
Int = hexadecimal_number
    | octal_number
    | binary_number
    | decimal_number

hexadecimal_number = "0", ("x" | "X"), hexadecimal_digit+
octal_number       = "0", ("o" | "O"), octal_digit+
binary_number      = "0", ("b" | "B"), binary_digit+
decimal_number     = "1".."9", decimal_digit*
Float = decimal_digit+, ".", decimal_digit+, scientific_notation?
      | decimal_digit+, scientific_notation

scientific_notation = "e", ("+" | "-"), decimal_digit+

Identifier & Datatypes

Identifier        = (underscore | lowercase_letter), identifier_letter*

identifier_letter = underscore | lowercase_letter | uppercase_letter | decimal_digit
Datatype = uppercase_letter, indentifier_letter*

Operator

If 2 or more operator chars are together, they count as a single operator. That is, +- always becomes a single token, not 2 + - tokens. The lexer is not aware of any operator.

Operator = operator_char+

Comments

At this time, only single line comments are allowed.

Strings

Strings in THP only use double quotes.

As of the writing of this page, an escape character is a backslash followed by any byte, except newline.

String = double_quote, (escape_seq | string_char)*, double_quote
escape_seq   = backslash, any_except_newline

double_quote = '"'
string_char  = any_except_newline_and_double_quote

Grouping signs

Each grouping sign has its own token.

LeftParen = "("
RightParen = ")"
LeftBracket = "["
RightBracket = "]"
LeftBrace = "{"
RightBrace = "}"

Syntax & AST

On this section of the grammar plain strings are used instead of keywords productions.

Source file & modules

Each THP source file is a module.

Module = Statement*

Statement

For now there is only 1 type of statement.

Statement = VariableBinding

Variable binding

Variable bindings have 2 forms: immutable & mutable. Immutable bindings use the val keyword, mutable bindings use var.

Bindings can have type annotations, placed between the keyword and the identifier.

If the binding is immutable and has a datatype, the val keyword can be dropped. Mutable bindings cannot drop the var keyword.

VariableBinding = ImmutableBinding | MutableBinding

ImmutableBinding = "val", Datatype?, Identifier, "=", Expression
                 |        Datatype,  Identifier, "=", Expression

MutableBinding   = "var", Datatype?, Identifier, "=", Expression

Expression

For now, the only expression recognized is a number.

Expression = Number