Oblivion Mod:Oblivion Tokenizer
The first stage in compiling an Oblivion script (since there is no preprocessor) will be the tokenizing of the script text into a stream of tokens ready to be passed onto the parsing stage. The tokenizer will be driven entirely by data supplied by an external text file.
The tokenizer can be split up into several sections:
-
- Character Classes -- Define a class for matching a custom set of characters.
- Token Types -- A class for handling a custom set of token definitions.
- Token Tables -- Table classes used for tokenizing a character stream using custom rules.
Token Parsing Algorithm[edit]
The token parsing uses a relatively simple table based ruleset which is essentially a finite state machine. The Token Table definitions use custom Character Classes to return custom Token Types, all as defined in an external text file.
The basic parsing algorithm is expressed below. Note that some error handling code removed to keep it simpler and the RunTokenTable() function is called recursively (an iterative version is preferred but more complex).
bool m_GetTokenChar; string m_CurrentToken; char m_CurrentChar; CTokenTable m_TokenTables[]; function Tokenize (ScanSource) { ScanSource->MoveToStart(); m_GetTokenChar = true; while (!ScanSource->IsEOF()) { pToken = RunTokenTable(m_TokenTables.GetStartTable(), ScanSource); if (pToken->IsEndToken()) break; } } function RunTokenTable(TokenTable, ScanSource) { if (m_GetTokenChar) { if (ScanSource->IsEOF()) return (NULL); m_GetTokenChar = false; m_CurrentChar = ScanSource->GetChar(); } for (Index = 0; Index < TokenTable->GetNumRows(); ++Index) { TokenRow = pTable->GetRow(Index); if (TokenRow ->DoesMatch(m_CurrentToken, m_CurrentChar)) { return DoRowAction(TokenRow, TokenTable); } } return DoRowAction(TokenTable->GetDefaultRow()); } function DoRowAction (TokenRow, TokenTable) { switch (TokenRow->GetAction()) { case TOKEN_RESULT_IGNORE: m_GetTokenChar = true; return RunTokenTable(TokenTable); case TOKEN_RESULT_ERROR: ReportError(TokenRow->GetMessage()); m_GetTokenChar = true; m_CurrentToken.Empty(); return (NULL); case TOKEN_RESULT_CONTINUE: m_CurrentToken += m_CurrentTokenChar; m_GetTokenChar = true; return RunTokenTable(TokenTable); case TOKEN_RESULT_MOVETO: m_CurrentToken += m_CurrentTokenChar; m_GetTokenChar = true; return RunTokenTable(TokenRow.GetActionTable()); case TOKEN_RESULT_JUMPTO: return RunTokenTable(TokenRow.GetActionTable()); case TOKEN_RESULT_JMPRETURN: SaveToken(m_CurrentToken, TokenRow->GetActionToken()); m_CurrentToken.Empty(); return (pRow->GetResult().GetToken()); case TOKEN_RESULT_RETURN: m_CurrentToken += m_CurrentTokenChar; SaveToken(m_CurrentToken, TokenRow->GetActionToken()); m_CurrentToken.Empty(); m_GetTokenChar = true; return (TokenRow.GetActionToken()); } }
The basic steps in the algorithm are:
-
- Get first character from scan source
- Keep parsing until the scan source is empty or an end token is reached.
-
- Set initial table to the start token table (Step A)
- Begin parsing table (Step B)
-
- Find row in current table that matches the current character. Use default row if no match found.
- Perform the action in the matching row:
-
- Ignore -- Ignore current character and get next character. Restart parsing the current table (goto Step B).
- MoveTo -- Add current character to token value and get next character. Start parsing the new table (goto Step B).
- JumpTo -- Start parsing the new table (goto Step B).
- Return -- Add current character to token value and get next character. Create a new token with the current value. Clear current token value. End parsing table (goto Step A).
- JmpReturn -- Create a new token with current value. Clear current token value. End parsing table (goto Step A).
- Continue -- Add current character to token value and get next character. Restart parsing the current table (goto Step B).
- Error -- Add current character to token value and get next character. Report Error. Clear token value. End parsing table (goto Step A).
Performance[edit]
The performance of this method of tokenizing is generally good. Even with a very simple implementation with no optimization all 1687 scripts (1.2 Mb, 60,000 lines) in Oblivion can be tokenized in roughly 0.5 seconds on a typical mid-end computer.