The THP programming language specification
This page (and following pages in the future) define the language.
THP is a strong, statically typed programming language that is transpiled to PHP. It is designed to improve on PHP’s shortcomings, mainly a better type system, better syntax and semantics, and better integration with tooling.
Compiler architecture
This is subject to change. We are still rewriting from Rust to Zig.
The compiler will have 5 phases:
- Lexical analysis: converts source code into a stream of tokens.
- Syntax analysis: converts a stream of tokens into an AST.
- Semantic analysis: performs type-checking and validations on the AST.
- IR transform: transforms the highlevel THP AST into a lower level representation. Unfolds syntax sugar.
- Code generation: transforms the IR into PHP code.
Source code representation
Source code must be UTF-8 encoded. Any non-ASCII bytes appearing within a string literal in source code carry their UTF-8 meaning into the content of the string, the bytes are not modified by the compiler.
Furthermore, THP only recognizes LF as line terminator. Using CRLF will lead to a compiler error.
Grammar syntax
This document uses a modified version of EBNF which allows the use of RegExp-like modifiers. An example is as follows:
; single line comments
literal = "a"
; ranges iterate over ASCII codepoints
range = "0".."9"
production_1 = character
concatenation = production_1, production_2
alternation = "a" | "b"
alternation_2 = "abc"
| "jkl"
| "xyz"
grouping = ("123", "456")
zero_or_one = production?
zero_or_more = production*
one_or_more = production+
Whitespace & Automatic Semicolon Insertion
Altough not yet implemented, THP will not use semicolons as statements delimitors. Instead, new lines will serve as statement delimitors.
THP is whitespace insensitive. However, THP has special rules when handling statement termination in order to not use semicolons.
Certain statements have clearly defined markers of termination.
For example, an if
statement always has braces {}
, so
the closing brace }
is the terminator. The same with
parenthesis, square brackets, etc.
Other statements require a explicit terminator. For example, the assignment statement:
val computation = 123 + 456 // how to detect if the statement ends here
* 789 // or extends up to here?
thp
In other languages a semicolon would be used to signal the end of the statement:
int computation = 123 + 456
* 789;
THP does not use semicolons. Instead, THP has 1 strict rule and 1 exception to the rule:
All statements end with a newline
No matter the indentation, whitespace or others, every statement ends with a newline.
val compute = 1 + 2 * 3 / 4
// statement ends here ↑
thp
As mentioned before, this does not affect statements that have clear delimiters. For example, the following code will work as expected:
val compute = my_function(
param1,
param2,
) / 64
// ↑ statement ends here
thp
In a way, the parenthesis will “disable” the rule.
But how to have an statement span multiple lines?
Exception: operator on the next line.
If the next line begins with any operator, the statement of the previous line continues.
For example:
val computation = 123 + 456
* 789
// ↑ statement ends here, and there is a single statement
thp
This is so no matter the indentation:
// weird indentation:
val computation = 123 + 456
- 789
// ↑ statement still ends here
thp
What is important is that an operator begins the new line. If the operator is left on the previous line, this will not work:
// statement ends here ↓, and now there is a syntax error (dangling operator)
val computation = 123 + 456 *
789
// ↑ this is a different statement
thp
For this the parser must do look-ahead of 1 token. This is the only place the parser does so.
Basic characters
newline = "\n"
character = '\0'..'\255' ; any ASCII character
lowercase_letter = "a".."z"
uppercase_letter = "A".."Z"
underscore = "_"
dot = "."
comma = ","
decimal_digit = "0".."9"
binary_digit = "0" | "1"
octal_digit = "0".."7"
hex_digit = "0".."9" | "a".."f" | "A".."F"
operator_char = "+" | "-" | "=" | "*" | "!" | "/" | "|"
| "@" | "#" | "$" | "~" | "%" | "&" | "?"
| "<" | ">" | "^" | "." | ":"
Tokens
This is a summary of all tokens:
pub const TokenType = enum {
Int,
Float,
Identifier,
Datatype,
Operator,
Comment,
String,
// grouping signs
LeftParen,
RightParen,
LeftBracket,
RightBracket,
LeftBrace,
RightBrace,
// punctiation that carries special meaning
Comma,
Newline,
// Each keyword will have its own token
};
Number
A decimal integer cannot have a leading zero. This: 0644
is
a lexic error. Floating point numbers, however, can have leading zeros:
0.6782e+2
.
In PHP an integer with a leading zero is not a decimal number, it’s
an octal number. So in PHP 0644 === 420
. To avoid any confusion,
decimal numbers cannot have a leading zero. Instead, all octal
numbers must begin with either 0o
or 0O
.
Number = Int | Float
Int = hexadecimal_number
| octal_number
| binary_number
| decimal_number
hexadecimal_number = "0", ("x" | "X"), hexadecimal_digit+
octal_number = "0", ("o" | "O"), octal_digit+
binary_number = "0", ("b" | "B"), binary_digit+
decimal_number = "1".."9", decimal_digit*
Float = decimal_digit+, ".", decimal_digit+, scientific_notation?
| decimal_digit+, scientific_notation
scientific_notation = "e", ("+" | "-"), decimal_digit+
Identifier & Datatypes
Identifier = (underscore | lowercase_letter), identifier_letter*
identifier_letter = underscore | lowercase_letter | uppercase_letter | decimal_digit
Datatype = uppercase_letter, indentifier_letter*
Operator
If 2 or more operator chars are together, they count as a single operator. That is,
+-
always becomes a single token, not 2 +
-
tokens. The lexer is not aware of
any operator.
Operator = operator_char+
Comments
At this time, only single line comments are allowed.
Strings
Strings in THP only use double quotes.
As of the writing of this page, an escape character is a backslash followed by any byte, except newline.
String = double_quote, (escape_seq | string_char)*, double_quote
escape_seq = backslash, any_except_newline
double_quote = '"'
string_char = any_except_newline_and_double_quote
Grouping signs
Each grouping sign has its own token.
LeftParen = "("
RightParen = ")"
LeftBracket = "["
RightBracket = "]"
LeftBrace = "{"
RightBrace = "}"
Syntax & AST
On this section of the grammar plain strings are used instead of keywords productions.
Source file & modules
Each THP source file is a module.
Module = Statement*
Statement
For now there is only 1 type of statement.
Statement = VariableBinding
Variable binding
Variable bindings have 2 forms: immutable & mutable.
Immutable bindings use the val
keyword, mutable bindings
use var
.
Bindings can have type annotations, placed between the keyword and the identifier.
If the binding is immutable and has a datatype, the val
keyword
can be dropped. Mutable bindings cannot drop the var keyword.
VariableBinding = ImmutableBinding | MutableBinding
ImmutableBinding = "val", Datatype?, Identifier, "=", Expression
| Datatype, Identifier, "=", Expression
MutableBinding = "var", Datatype?, Identifier, "=", Expression
Expression
For now, the only expression recognized is a number.
Expression = Number