Home › Developer Docs › Aether Lexer Documentation

Aether Lexer Documentation

Overview
Architecture
Components
Tokenization Process
Examples
Error Handling
Testing
Performance Considerations
Usage
Implementation Notes
Integration with Parser
Common Tasks
References

Overview

The lexer (lexical analyzer) is the first phase of the Aether interpreter. It converts raw source code into a stream of tokens that can be parsed.

Location: src/lexer/

Status: ✅ Complete (14 unit tests passing)

Architecture

Source Code (.ae file)
        ↓
    Scanner
        ↓
   Token Stream
        ↓
    To Parser

Components

1. Token (`token.rs`)

Represents a single lexical unit in the source code.

TokenKind Enum

All token types supported by Aether:

Literals:

Integer(i64) - Integer literal (e.g., 42)
Float(f64) - Float literal (e.g., 3.14)
String(String) - String literal (e.g., "hello")
True - Boolean true
False - Boolean false
Null - Null value

Keywords:

Let - Variable declaration
Fn - Function declaration
Return - Return statement
If, Else - Conditionals
While, For, In - Loops
Break, Continue - Loop control
Import, From, As - Module system
Struct - Struct declaration
Async, Await - Async/await
Try, Catch, Throw, Finally - Error handling

Operators:

Arithmetic: Plus, Minus, Star, Slash, Percent
Assignment: Equal, PlusEqual, MinusEqual, StarEqual, SlashEqual
Comparison: EqualEqual, NotEqual, Less, Greater, LessEqual, GreaterEqual
Logical: And, Or, Not
Null coalescing: QuestionQuestion (??)
Optional chaining: QuestionDot (?.)
Spread: Spread (...)

Delimiters:

LeftParen, RightParen - (, )
LeftBrace, RightBrace - {, }
LeftBracket, RightBracket - [, ]
Comma, Dot, Colon - ,, ., :

Special:

Newline - Line break
Eof - End of file

Token Struct

pub struct Token {
    pub kind: TokenKind,      // Type of token
    pub lexeme: String,       // Original text
    pub line: usize,          // Line number (1-indexed)
    pub column: usize,        // Column number (1-indexed)
}

Position tracking enables helpful error messages with exact locations.

2. Scanner (`scanner.rs`)

The main lexer implementation that tokenizes source code.

Scanner Struct

pub struct Scanner {
    source: Vec<char>,        // Source code as characters
    tokens: Vec<Token>,       // Accumulated tokens
    start: usize,             // Start of current token
    current: usize,           // Current position
    line: usize,              // Current line
    column: usize,            // Current column
}

Main Method

pub fn scan_tokens(&mut self) -> Result<Vec<Token>, LexerError>

Scans the entire source and returns all tokens or an error.

Error Types

pub enum LexerError {
    UnexpectedCharacter(char, usize, usize),
    UnterminatedString(usize, usize),
    InvalidNumber(String, usize, usize),
}

Each error includes position information for debugging.

Tokenization Process

1. Character-by-Character Scanning

The scanner reads one character at a time and determines what token to create:

fn scan_token(&mut self) -> Result<(), LexerError> {
    let c = self.advance();
    match c {
        ' ' | '\r' | '\t' => {} // Skip whitespace
        '\n' => { /* Track newlines */ }
        '(' => self.add_token(TokenKind::LeftParen),
        // ... more cases
    }
}

2. Number Tokenization

Integer: Sequence of digits

42 → Integer(42)

Float: Digits with decimal point

3.14 → Float(3.14)

Process:

Consume all digits
Check for decimal point followed by digits
Parse as i64 or f64
Return error if parsing fails

3. String Tokenization

Syntax: Text enclosed in double quotes "text"

Features:

Escape sequences: \n, \t, \\, \"
Multi-line strings supported
UTF-8 encoding

Process:

Consume characters until closing "
Process escape sequences
Return error if unterminated

Example:

"hello\nworld" → String("hello\nworld")

4. Identifier and Keyword Tokenization

Identifiers: Start with letter or underscore, followed by alphanumeric or underscore

Process:

Consume all alphanumeric/underscore characters
Check if it’s a keyword
Return keyword token or identifier token

Keywords Map (complete list):

match text.as_str() {
    "let" => TokenKind::Let,
    "fn" => TokenKind::Fn,
    "if" => TokenKind::If,
    "else" => TokenKind::Else,
    "while" => TokenKind::While,
    "for" => TokenKind::For,
    "in" => TokenKind::In,
    "return" => TokenKind::Return,
    "break" => TokenKind::Break,
    "continue" => TokenKind::Continue,
    "import" => TokenKind::Import,
    "from" => TokenKind::From,
    "as" => TokenKind::As,
    "struct" => TokenKind::Struct,
    "async" => TokenKind::Async,
    "await" => TokenKind::Await,
    "try" => TokenKind::Try,
    "catch" => TokenKind::Catch,
    "throw" => TokenKind::Throw,
    "finally" => TokenKind::Finally,
    "true" => TokenKind::True,
    "false" => TokenKind::False,
    "null" => TokenKind::Null,
    _ => TokenKind::Identifier(text)
}

5. Operator Tokenization

Single-character operators: +, -, *, %

Multi-character operators: Lookahead for second character

= → Equal or == → EqualEqual
! → Not or != → NotEqual
+ → Plus or += → PlusEqual

Example:

if self.match_char('=') {
    TokenKind::PlusEqual  // +=
} else {
    TokenKind::Plus       // +
}

6. Comment Handling

Single-line: // until end of line

if self.match_char('/') {
    // Skip until newline
    while self.peek() != '\n' && !self.is_at_end() {
        self.advance();
    }
}

Multi-line: /* ... */

if self.match_char('*') {
    while !self.is_at_end() {
        if self.peek() == '*' && self.peek_next() == '/' {
            self.advance(); // *
            self.advance(); // /
            return Ok(());
        }
        self.advance();
    }
}

Examples

Example 1: Simple Expression

Input:

let x = 10 + 20

Output Tokens:

Token { kind: Let, lexeme: "let", line: 1, column: 1 }
Token { kind: Identifier("x"), lexeme: "x", line: 1, column: 5 }
Token { kind: Equal, lexeme: "=", line: 1, column: 7 }
Token { kind: Integer(10), lexeme: "10", line: 1, column: 9 }
Token { kind: Plus, lexeme: "+", line: 1, column: 12 }
Token { kind: Integer(20), lexeme: "20", line: 1, column: 14 }
Token { kind: Eof, lexeme: "", line: 1, column: 16 }

Example 2: Function Definition

Input:

fn add(a, b) {
    return a + b
}

Output Tokens:

Fn → "fn"
Identifier("add") → "add"
LeftParen → "("
Identifier("a") → "a"
Comma → ","
Identifier("b") → "b"
RightParen → ")"
LeftBrace → "{"
Return → "return"
Identifier("a") → "a"
Plus → "+"
Identifier("b") → "b"
RightBrace → "}"
Eof

Example 3: String with Escapes

Input:

"hello\nworld"

Output:

Token {
    kind: String("hello\nworld"),  // Actual newline character
    lexeme: "\"hello\\nworld\"",   // Original text
    line: 1,
    column: 1
}

Error Handling

Unexpected Character

Input: @#$

Error:

LexerError::UnexpectedCharacter('@', 1, 1)
→ "Unexpected character '@' at line 1, column 1"

Unterminated String

Input: "hello

Error:

LexerError::UnterminatedString(1, 1)
→ "Unterminated string at line 1, column 1"

Invalid Number

Input: 123abc (if lexer tries to parse as number)

Error:

LexerError::InvalidNumber("123abc", 1, 1)
→ "Invalid number '123abc' at line 1, column 1"

Testing

Test File: src/lexer/lexer_tests.rs

Coverage: 14 tests

Test Categories:

Token creation - Basic token structure
Literals - Integers, floats, strings, booleans
Escape sequences - \n, \t, \\, \"
Keywords - All language keywords
Operators - Arithmetic, comparison, logical
Identifiers - Variable names
Comments - Single-line and multi-line
Complete expressions - Real code snippets

Example Test:

#[test]
fn test_tokenize_integer() {
    let mut scanner = Scanner::new("42");
    let tokens = scanner.scan_tokens().unwrap();
    assert_eq!(tokens.len(), 2); // integer + EOF
    assert_eq!(tokens[0].kind, TokenKind::Integer(42));
}

Performance Considerations

Current Implementation

Single pass: Reads source once
Character-by-character: Simple and correct
String allocation: Each token stores its lexeme
Vec growth: Tokens accumulated in vector

Optimization Opportunities (Future)

String interning: Reuse common strings
Arena allocation: Reduce allocations
Lazy tokenization: On-demand token generation
Parallel lexing: For large files

Usage

use aether::lexer::Scanner;

fn main() {
    let source = "let x = 42";
    let mut scanner = Scanner::new(source);

    match scanner.scan_tokens() {
        Ok(tokens) => {
            for token in tokens {
                println!("{:?}", token);
            }
        }
        Err(error) => {
            eprintln!("Lexer error: {}", error);
        }
    }
}

Implementation Notes

Why Vec<char> Instead of &str?

source: Vec<char>  // Current
// vs
source: &str       // Alternative

Reason: Using Vec<char> allows:

Easy indexing: self.source[i]
UTF-8 handling: Each char is a valid Unicode scalar
Lookahead: peek() and peek_next() are simple

Trade-off: More memory but simpler code

Position Tracking

Both line and column are tracked for error messages:

'\n' => {
    self.line += 1;
    self.column = 1;
}
_ => {
    self.column += 1;
}

This enables precise error reporting.

Integration with Parser

The lexer output flows directly into the parser:

// Lexer → Parser
let mut scanner = Scanner::new(source);
let tokens = scanner.scan_tokens()?;
let mut parser = Parser::new(tokens);
let ast = parser.parse()?;

Common Tasks

How to Add a New Operator

Example: Adding the ** (exponentiation) operator

Step 1: Add token to token.rs

pub enum TokenKind {
    // ... existing tokens
    StarStar,  // ** (exponentiation)
}

Step 2: Update scanner.rs in scan_token()

'*' => {
    if self.match_char('*') {
        self.add_token(TokenKind::StarStar);  // **
    } else {
        self.add_token(TokenKind::Star);       // *
    }
}

Step 3: Write test in lexer_tests.rs

#[test]
fn test_exponentiation_operator() {
    let mut scanner = Scanner::new("2 ** 3");
    let tokens = scanner.scan_tokens().unwrap();
    assert_eq!(tokens[1].kind, TokenKind::StarStar);
}

Step 4: Run test and verify

cargo test test_exponentiation_operator

How to Add a New Keyword

Example: Adding the class keyword

Step 1: Add token to token.rs

pub enum TokenKind {
    // ... existing keywords
    Class,  // class keyword
}

Step 2: Update keyword matching in scanner.rs

fn identifier(&mut self) {
    // ... consume identifier
    let text = self.source[self.start..self.current].iter().collect::<String>();

    let kind = match text.as_str() {
        "class" => TokenKind::Class,  // Add new keyword
        "let" => TokenKind::Let,
        // ... other keywords
        _ => TokenKind::Identifier(text.clone()),
    };
}

Step 3: Write test

#[test]
fn test_class_keyword() {
    let mut scanner = Scanner::new("class MyClass");
    let tokens = scanner.scan_tokens().unwrap();
    assert_eq!(tokens[0].kind, TokenKind::Class);
}

How to Add a New Literal Type

Example: Adding hexadecimal integers (0xFF)

Step 1: Update TokenKind if needed (or reuse Integer)

Step 2: Add recognition in scanner.rs

'0' => {
    if self.match_char('x') || self.match_char('X') {
        self.hex_number();  // New method for hex
    } else {
        self.number();       // Regular decimal
    }
}

fn hex_number(&mut self) {
    while self.peek().is_ascii_hexdigit() {
        self.advance();
    }

    let hex_str = &self.source[self.start + 2..self.current]
        .iter().collect::<String>();
    let value = i64::from_str_radix(hex_str, 16).unwrap();
    self.add_token(TokenKind::Integer(value));
}

Step 3: Write comprehensive tests

#[test]
fn test_hex_literals() {
    assert_eq!(scan("0xFF"), vec![TokenKind::Integer(255)]);
    assert_eq!(scan("0x10"), vec![TokenKind::Integer(16)]);
}

Debugging Tips

Problem: Token not recognized

Check: Is the character handled in scan_token()?
Check: Are you calling advance() before checking the character?

Problem: Position tracking incorrect

Check: Are you incrementing line and resetting column on \n?
Check: Is column incremented for every non-newline character?

Problem: Unexpected string/number parsing

Check: Escape sequences handled in string() method?
Check: Decimal point logic correct in number() method?

References

Source: src/lexer/
Tests: src/lexer/lexer_tests.rs
Design: docs/DESIGN.md - Token types and syntax
Development: docs/DEVELOPMENT.md - Testing guidelines

Last Updated: April 17, 2026 Status: 14 unit tests passing — no changes since initial implementation

← Testing Guide Parser →

Aether Lexer Documentation

Table of Contents

Overview

Architecture

Components

1. Token (token.rs)

TokenKind Enum

Token Struct

2. Scanner (scanner.rs)

Scanner Struct

Main Method

Error Types

Tokenization Process

1. Character-by-Character Scanning

2. Number Tokenization

3. String Tokenization

4. Identifier and Keyword Tokenization

5. Operator Tokenization

6. Comment Handling

Examples

Example 1: Simple Expression

Example 2: Function Definition

Example 3: String with Escapes

Error Handling

Unexpected Character

Unterminated String

Invalid Number

Testing

Performance Considerations

Current Implementation

Optimization Opportunities (Future)

Usage

Implementation Notes

Why Vec<char> Instead of &str?

Position Tracking

Integration with Parser

Common Tasks

How to Add a New Operator

How to Add a New Keyword

How to Add a New Literal Type

Debugging Tips

References

1. Token (`token.rs`)

2. Scanner (`scanner.rs`)