The THP Language Specification
This series of pages define the THP Programming Language.
THP’s grammar is context-dependant.
The syntax is specified using a weird mix of Extended Backus Naur Form and RegExp:
; comments
syntax = concatenation
concatenation = production1, production2
alternation = "a" | "b"
| "c"
grouping = ("a", "b")
optional = "a"?
one_or_more = "a"+
zero_or_more = "a"*
range = "1".."9"
literal = "a"
Compiler architecture
The compiler consists of 5 common phases:
- Lexical Analysis: Transforms the source code into tokens
- Syntactic Analysis: Parses the tokens and generates an AST
- Semantic Analysis: Checks the AST structure and performs type checking
- IR: Transforms the THP AST into a PHP AST
- Codegen: Generates PHP source code from the PHP AST
Source Code representation
Source code is encoded in UTF-8, and a single UTF-8 codepoint is a single character.
Basic characters
Although the source code must be encoded in UTF-8, most of the actual source code will use only the basic 128 ASCII characters. String contents may contain any Unicode code point.
underscore = "_"
decimal_digit = "0".."9"
binary_digit = "0" | "1"
octal_digit = "0".."7"
hex_digit = decimal_digit | "a".."f" | "A".."F"
lowercase_letter = "a".."z"
uppercase_letter = "A".."Z"
Whitespace & Automatic semicolon insertion
This section is being reworked on the Zig rewrite of the compiler.
THP is whitespace insensitive. However, THP has special rules when handling statement termination in order to not use semicolons.
Certain statements have clearly defined markers of termination.
For example, an if
statement always has braces {}
, so
the closing brace }
is the terminator. The same with
parenthesis, square brackets, etc.
Other statements require a explicit terminator. For example, the assignment statement:
val computation = 123 + 456 // how to detect if the statement ends here
* 789 // or extends up to here?
thp
In other languages a semicolon would be used to signal the end of the statement:
int computation = 123 + 456
* 789;
THP does not use semicolons. Instead, THP has 1 strict rule and 1 exception to the rule:
All statements end with a newline
No matter the indentation, whitespace or others, every statement ends with a newline.
val compute = 1 + 2 * 3 / 4
// statement ends here ↑
thp
As mentioned before, this does not affect statements that have clear delimiters. For example, the following code will work as expected:
val compute = my_function(
param1,
param2,
) / 64
// ↑ statement ends here
thp
In a way, the parenthesis will “disable” the rule.
But how to have an statement span multiple lines?
Exception: operator on the next line.
If the next line begins with any operator, the statement of the previous line continues.
For example:
val computation = 123 + 456
* 789
// ↑ statement ends here, and there is a single statement
thp
This is so no matter the indentation:
// weird indentation:
val computation = 123 + 456
- 789
// ↑ statement still ends here
thp
What is important is that an operator begins the new line. If the operator is left on the previous line, this will not work:
// statement ends here ↓, and now there is a syntax error (dangling operator)
val computation = 123 + 456 *
789
// ↑ this is a different statement
thp
For this the parser must do look-ahead of 1 token. This is the only place the parser does so.
Old Whitespace rules
THP is partially whitespace sensitive. It uses the following tokens: Indent, Dedent & NewLine to determine when an expression spans multiple lines.
The lexer stores the indentation level of every line, and when scanning the next line, compares the previous indentation to the new one. If the amount of whitespace is greater than before, it emits a Indent token. If it’s lower, emits a Dedent token, and if it’s the same it does nothing.
1 + 2
+ 3
+ 4
thp
The previous code would emit the following tokens: 1
+
2
NewLine
Indent
+
3
NewLine
+
4
Dedent
Additionaly, it is a lexical error to have wrong indentation. The lexer stores all previous indentation levels in a stack, and reports an error if a decrease in indentation doesn’t match a previous level.
if true { // 0 indentation
print() // 4 indentation
print() // 2 indentation. Error. There is no 2-indentation level
}
thp
All productions of the grammar ignore whitespace/indentation, except those involved in semicolon inference.
Statement termination / Semicolon inference
Only inside a block of code whitespace is used to determine where a statement ends and a new one begins. Everywhere else whitespace is ignored.
Statements in THP end when a new line is encountered:
// The statement ends | here, on the newline
val value = (123 + 456) * 0.75
thp
// Each line contains a different statement. They all end on their new lines
var a = 1 + 2 // a = 3
- 3 // this is not part of `a`, this is a different statement
thp
This is true even if the line ends with an operator:
// These are still different statements
var a = 1 + 2 + // This is now a compile error, there is a hanging
3 // This is still a different statement
thp
Parenthesis
Exception 1: When a parenthesis is open, all following whitespace is ignored until the closing parenthesis.
// open parenthesis found, all whitespace is ignored until the closing
name.contains(
"weird"
)
thp
However, for a parenthesis to begin to act, it needs to be open on the same line.
// Still 2 statements, because the parenthesis is in a new line
print
(
"hello"
)
// Now it's one single statement
print(
"hello"
)
thp
Indented binary operator
Exception 2:
- When a binary operator is followed by indentation:
val sum = 1 + 2 + // The line ends with a binary operator
3 // There is indentation
thp
- Or when indentation is followed by a binary operator:
val sum = 1 + 2
+ 3 // Indentation and a binary operator
thp
In theses cases, all whitespace will be ignored until the indentation returns to the initial level.
// This method chain is a single statement because of the indentation
val person = PersonBuilder()
.set_name("john")
.set_lastname("doe")
.set_age(32)
.set_children(2)
.build()
// Here indentation returns, and a new statement begins
print(person)
thp