The HashTable Tokenizer is a C++ implemented hash table designed for efficient tokenization. This project provides a solution to convert text into numeric tokens and vice versa, facilitating the intermediate processing of large language modele. The implementation uses a custom hash function with separate chaining to resolve collisions, optimized for quick retrieval and insertion of token-word pairs.
- Custom Hashing: Implements a unique hashing function for distributing words efficiently across the hash table.
- Separate Chaining: Utilizes separate chaining to handle collisions, ensuring fast access even with hash conflicts.
- No STL Libraries: Developed without the use of Standard Template Library (STL) to meet project constraints.
- Efficient Tokenization: Supports operations to tokenize text to numeric tokens and retrieve text from tokens.
- Memory Management: Careful handling of memory allocation and deallocation to prevent leaks and ensure optimal performance.
Ensure you have a C++ compiler installed on your system (GCC recommended). This project does not require any external libraries.
To compile the project, navigate to the project directory and use the provided Makefile:
makeAfter compilation, you can run the program using:
./a.out < input_fileWhere input_file is a file containing a sequence of commands as described in the "Commands" section.
This project accepts the following commands:
M m: Initializes a new hash table with sizem.INSERT word: Inserts a word into the tokenizer.READ filename: Reads words from a specified file.TOKENIZE word: Returns the token associated with the word.RETRIEVE t: Retrieves the word associated with tokent.STOK string_of_words: Tokenizes a string of words.TOKS string_of_tokens: Turns a string of tokens back into words.PRINT k: Prints the keys stored at positionkin the hash table.
The input files should end with the string "EXIT", which will terminate the program. The output for each command is either a success or failure message, a token, a word, or a series of tokens or words depending on the command.
For a detailed explanation of the program's design and implementation, refer to the design document included in the repository.
To test the project with Valgrind for memory leaks:
valgrind ./a.out < test01.inReplace test01.in with your test input file.
Contributions to the HashTable Tokenizer are welcome. Please feel free to fork the repository, make changes, and submit pull requests.