Skip to the content.

Home › Developer Docs › Aether Lexer Documentation

Aether Lexer Documentation

Table of Contents

Overview

The lexer (lexical analyzer) is the first phase of the Aether interpreter. It converts raw source code into a stream of tokens that can be parsed.

Location: src/lexer/

Status: ✅ Complete (14 unit tests passing)

Architecture

Source Code (.ae file)
        
    Scanner
        
   Token Stream
        
    To Parser

Components

1. Token (token.rs)

Represents a single lexical unit in the source code.

TokenKind Enum

All token types supported by Aether:

Literals:

Keywords:

Operators:

Delimiters:

Special:

Token Struct

pub struct Token {
    pub kind: TokenKind,      // Type of token
    pub lexeme: String,       // Original text
    pub line: usize,          // Line number (1-indexed)
    pub column: usize,        // Column number (1-indexed)
}

Position tracking enables helpful error messages with exact locations.

2. Scanner (scanner.rs)

The main lexer implementation that tokenizes source code.

Scanner Struct

pub struct Scanner {
    source: Vec<char>,        // Source code as characters
    tokens: Vec<Token>,       // Accumulated tokens
    start: usize,             // Start of current token
    current: usize,           // Current position
    line: usize,              // Current line
    column: usize,            // Current column
}

Main Method

pub fn scan_tokens(&mut self) -> Result<Vec<Token>, LexerError>

Scans the entire source and returns all tokens or an error.

Error Types

pub enum LexerError {
    UnexpectedCharacter(char, usize, usize),
    UnterminatedString(usize, usize),
    InvalidNumber(String, usize, usize),
}

Each error includes position information for debugging.

Tokenization Process

1. Character-by-Character Scanning

The scanner reads one character at a time and determines what token to create:

fn scan_token(&mut self) -> Result<(), LexerError> {
    let c = self.advance();
    match c {
        ' ' | '\r' | '\t' => {} // Skip whitespace
        '\n' => { /* Track newlines */ }
        '(' => self.add_token(TokenKind::LeftParen),
        // ... more cases
    }
}

2. Number Tokenization

Integer: Sequence of digits

42  Integer(42)

Float: Digits with decimal point

3.14  Float(3.14)

Process:

  1. Consume all digits
  2. Check for decimal point followed by digits
  3. Parse as i64 or f64
  4. Return error if parsing fails

3. String Tokenization

Syntax: Text enclosed in double quotes "text"

Features:

Process:

  1. Consume characters until closing "
  2. Process escape sequences
  3. Return error if unterminated

Example:

"hello\nworld"  String("hello\nworld")

4. Identifier and Keyword Tokenization

Identifiers: Start with letter or underscore, followed by alphanumeric or underscore

Process:

  1. Consume all alphanumeric/underscore characters
  2. Check if it’s a keyword
  3. Return keyword token or identifier token

Keywords Map (complete list):

match text.as_str() {
    "let" => TokenKind::Let,
    "fn" => TokenKind::Fn,
    "if" => TokenKind::If,
    "else" => TokenKind::Else,
    "while" => TokenKind::While,
    "for" => TokenKind::For,
    "in" => TokenKind::In,
    "return" => TokenKind::Return,
    "break" => TokenKind::Break,
    "continue" => TokenKind::Continue,
    "import" => TokenKind::Import,
    "from" => TokenKind::From,
    "as" => TokenKind::As,
    "struct" => TokenKind::Struct,
    "async" => TokenKind::Async,
    "await" => TokenKind::Await,
    "try" => TokenKind::Try,
    "catch" => TokenKind::Catch,
    "throw" => TokenKind::Throw,
    "finally" => TokenKind::Finally,
    "true" => TokenKind::True,
    "false" => TokenKind::False,
    "null" => TokenKind::Null,
    _ => TokenKind::Identifier(text)
}

5. Operator Tokenization

Single-character operators: +, -, *, %

Multi-character operators: Lookahead for second character

Example:

if self.match_char('=') {
    TokenKind::PlusEqual  // +=
} else {
    TokenKind::Plus       // +
}

6. Comment Handling

Single-line: // until end of line

if self.match_char('/') {
    // Skip until newline
    while self.peek() != '\n' && !self.is_at_end() {
        self.advance();
    }
}

Multi-line: /* ... */

if self.match_char('*') {
    while !self.is_at_end() {
        if self.peek() == '*' && self.peek_next() == '/' {
            self.advance(); // *
            self.advance(); // /
            return Ok(());
        }
        self.advance();
    }
}

Examples

Example 1: Simple Expression

Input:

let x = 10 + 20

Output Tokens:

1. Token { kind: Let, lexeme: "let", line: 1, column: 1 }
2. Token { kind: Identifier("x"), lexeme: "x", line: 1, column: 5 }
3. Token { kind: Equal, lexeme: "=", line: 1, column: 7 }
4. Token { kind: Integer(10), lexeme: "10", line: 1, column: 9 }
5. Token { kind: Plus, lexeme: "+", line: 1, column: 12 }
6. Token { kind: Integer(20), lexeme: "20", line: 1, column: 14 }
7. Token { kind: Eof, lexeme: "", line: 1, column: 16 }

Example 2: Function Definition

Input:

fn add(a, b) {
    return a + b
}

Output Tokens:

1. Fn  "fn"
2. Identifier("add")  "add"
3. LeftParen  "("
4. Identifier("a")  "a"
5. Comma  ","
6. Identifier("b")  "b"
7. RightParen  ")"
8. LeftBrace  "{"
9. Return  "return"
10. Identifier("a")  "a"
11. Plus  "+"
12. Identifier("b")  "b"
13. RightBrace  "}"
14. Eof

Example 3: String with Escapes

Input:

"hello\nworld"

Output:

Token {
    kind: String("hello\nworld"),  // Actual newline character
    lexeme: "\"hello\\nworld\"",   // Original text
    line: 1,
    column: 1
}

Error Handling

Unexpected Character

Input: @#$

Error:

LexerError::UnexpectedCharacter('@', 1, 1)
 "Unexpected character '@' at line 1, column 1"

Unterminated String

Input: "hello

Error:

LexerError::UnterminatedString(1, 1)
 "Unterminated string at line 1, column 1"

Invalid Number

Input: 123abc (if lexer tries to parse as number)

Error:

LexerError::InvalidNumber("123abc", 1, 1)
 "Invalid number '123abc' at line 1, column 1"

Testing

Test File: src/lexer/lexer_tests.rs

Coverage: 14 tests

Test Categories:

  1. Token creation - Basic token structure
  2. Literals - Integers, floats, strings, booleans
  3. Escape sequences - \n, \t, \\, \"
  4. Keywords - All language keywords
  5. Operators - Arithmetic, comparison, logical
  6. Identifiers - Variable names
  7. Comments - Single-line and multi-line
  8. Complete expressions - Real code snippets

Example Test:

#[test]
fn test_tokenize_integer() {
    let mut scanner = Scanner::new("42");
    let tokens = scanner.scan_tokens().unwrap();
    assert_eq!(tokens.len(), 2); // integer + EOF
    assert_eq!(tokens[0].kind, TokenKind::Integer(42));
}

Performance Considerations

Current Implementation

Optimization Opportunities (Future)

Usage

use aether::lexer::Scanner;

fn main() {
    let source = "let x = 42";
    let mut scanner = Scanner::new(source);

    match scanner.scan_tokens() {
        Ok(tokens) => {
            for token in tokens {
                println!("{:?}", token);
            }
        }
        Err(error) => {
            eprintln!("Lexer error: {}", error);
        }
    }
}

Implementation Notes

Why Vec<char> Instead of &str?

source: Vec<char>  // Current
// vs
source: &str       // Alternative

Reason: Using Vec<char> allows:

Trade-off: More memory but simpler code

Position Tracking

Both line and column are tracked for error messages:

'\n' => {
    self.line += 1;
    self.column = 1;
}
_ => {
    self.column += 1;
}

This enables precise error reporting.

Integration with Parser

The lexer output flows directly into the parser:

// Lexer → Parser
let mut scanner = Scanner::new(source);
let tokens = scanner.scan_tokens()?;
let mut parser = Parser::new(tokens);
let ast = parser.parse()?;

Common Tasks

How to Add a New Operator

Example: Adding the ** (exponentiation) operator

Step 1: Add token to token.rs

pub enum TokenKind {
    // ... existing tokens
    StarStar,  // ** (exponentiation)
}

Step 2: Update scanner.rs in scan_token()

'*' => {
    if self.match_char('*') {
        self.add_token(TokenKind::StarStar);  // **
    } else {
        self.add_token(TokenKind::Star);       // *
    }
}

Step 3: Write test in lexer_tests.rs

#[test]
fn test_exponentiation_operator() {
    let mut scanner = Scanner::new("2 ** 3");
    let tokens = scanner.scan_tokens().unwrap();
    assert_eq!(tokens[1].kind, TokenKind::StarStar);
}

Step 4: Run test and verify

cargo test test_exponentiation_operator

How to Add a New Keyword

Example: Adding the class keyword

Step 1: Add token to token.rs

pub enum TokenKind {
    // ... existing keywords
    Class,  // class keyword
}

Step 2: Update keyword matching in scanner.rs

fn identifier(&mut self) {
    // ... consume identifier
    let text = self.source[self.start..self.current].iter().collect::<String>();

    let kind = match text.as_str() {
        "class" => TokenKind::Class,  // Add new keyword
        "let" => TokenKind::Let,
        // ... other keywords
        _ => TokenKind::Identifier(text.clone()),
    };
}

Step 3: Write test

#[test]
fn test_class_keyword() {
    let mut scanner = Scanner::new("class MyClass");
    let tokens = scanner.scan_tokens().unwrap();
    assert_eq!(tokens[0].kind, TokenKind::Class);
}

How to Add a New Literal Type

Example: Adding hexadecimal integers (0xFF)

Step 1: Update TokenKind if needed (or reuse Integer)

Step 2: Add recognition in scanner.rs

'0' => {
    if self.match_char('x') || self.match_char('X') {
        self.hex_number();  // New method for hex
    } else {
        self.number();       // Regular decimal
    }
}

fn hex_number(&mut self) {
    while self.peek().is_ascii_hexdigit() {
        self.advance();
    }

    let hex_str = &self.source[self.start + 2..self.current]
        .iter().collect::<String>();
    let value = i64::from_str_radix(hex_str, 16).unwrap();
    self.add_token(TokenKind::Integer(value));
}

Step 3: Write comprehensive tests

#[test]
fn test_hex_literals() {
    assert_eq!(scan("0xFF"), vec![TokenKind::Integer(255)]);
    assert_eq!(scan("0x10"), vec![TokenKind::Integer(16)]);
}

Debugging Tips

Problem: Token not recognized

Problem: Position tracking incorrect

Problem: Unexpected string/number parsing

References


Last Updated: April 17, 2026 Status: 14 unit tests passing — no changes since initial implementation


← Testing Guide    Parser →