The THP Language Specification
This series of pages define the THP Programming Language.
THP’s grammar is context-dependant.
The syntax is specified using a weird mix of Extended Backus Naur Form and RegExp:
; comments
syntax = concatenation
concatenation = alternation grouping
alternation = "a" | "b"
| "c"
grouping = ("a", "b")
optional = "a"?
one_or_more = "a"+
zero_or_more = "a"*
range = "1".."9"
literal = "a"
Compiler architecture
The compiler consists of 5 common phases:
- Lexical Analysis: Transforms the source code into tokens
- Syntactic Analysis: Parses the tokens and generates an AST
- Semantic Analysis: Checks the AST structure and performs type checking
- IR: Transforms the THP AST into a PHP AST
- Codegen: Generates PHP source code from the PHP AST
Source Code representation
Source code is encoded in UTF-8, and a single UTF-8 codepoint is a single character. As THP is implemented using the Rust programming language, rules around Rust’s UTF-8 usage are followed.
Basic characters
Although the source code must be encoded in UTF-8, most of the actual source code will use only the basic 128 ASCII characters. String contents may contain any Unicode code point.
underscore = "_"
decimal_digit = "0".."9"
binary_digit = "0" | "1"
octal_digit = "0".."7"
hex_digit = decimal_digit | "a".."f" | "A".."F"
lowercase_letter = "a".."z"
uppercase_letter = "A".."Z"
Whitespace
THP is partially whitespace sensitive. It uses the following tokens: Indent, Dedent & NewLine to determine when an expression spans multiple lines.
The lexer stores the indentation level of every line, and when scanning the next line, compares the previous indentation to the new one. If the amount of whitespace is greater than before, it emits a Indent token. If it’s lower, emits a Dedent token, and if it’s the same it does nothing.
1 + 2
+ 3
+ 4
thp
The previous code would emit the following tokens: 1
+
2
NewLine
Indent
+
3
NewLine
+
4
Dedent
Additionaly, it is a lexical error to have wrong indentation. The lexer stores all previous indentation levels in a stack, and reports an error if a decrease in indentation doesn’t match a previous level.
if true { // 0 indentation
print() // 4 indentation
print() // 2 indentation. Error. There is no 2-indentation level
}
thp
All productions of the grammar ignore whitespace/indentation, except those involved in semicolon inference.
Statement termination / Semicolon inference
Only inside a block of code whitespace is used to determine where a statement ends and a new one begins. Everywhere else whitespace is ignored.
Statements in THP end when a new line is encountered:
// The statement ends | here, on the newline
val value = (123 + 456) * 0.75
thp
// Each line contains a different statement. They all end on their new lines
var a = 1 + 2 // a = 3
+ 3 // this is not part of `a`, this is a different statement
thp
This is true even if the line ends with an operator:
// These are still different statements
var a = 1 + 2 + // This is now a compile error, there is a hanging
3 // This is still a different statement
thp
Parenthesis
Exception 1: When a parenthesis is open, all following whitespace is ignored until the closing parenthesis.
// open parenthesis found, all whitespace is ignored until the closing
name.contains(
"weird"
)
thp
However, for a parenthesis to begin to act, it needs to be open on the same line.
// Still 2 statements, because the parenthesis is in a new line
print
(
"hello"
)
// Now it's one single statement
print(
"hello"
)
thp
Indented binary operator
Exception 2:
- When a binary operator is followed by indentation:
val sum = 1 + 2 + // The line ends with a binary operator
3 // There is indentation
thp
- Or when indentation is followed by a binary operator:
val sum = 1 + 2
+ 3 // Indentation and a binary operator
thp
In theses cases, all whitespace will be ignored until the indentation returns to the initial level.
// This method chain is a single statement because of the indentation
val person = PersonBuilder()
.set_name("john")
.set_lastname("doe")
.set_age(32)
.set_children(2)
.build()
// Here indentation returns, and a new statement begins
print(person)
thp