Introduction to Tries

Estimated Time

1.5 hours

A Short Introduction to the Overall Experiment

Introduction video for Tries and Suffix Tries.

Prerequisites of the Experiment

This experiment requires you to have basic knowledge about :

And above all, a curiosity to learn and explore!

Overview of the Experiment

  • This experiment aims to help understand what a Trie data structure is and how it can be used to store words and find words in that list in a manner that is both space and time efficient.
  • The experiment features a series of modules with video lectures, interactive demonstrations, simulations, hands-on practice exercises and quizzes for self analysis.

Learning Objectives of the Experiment

In this experiment, you will be able to do the following:

  • Given a pre-build Trie, search for a given word in it.
  • Given a list of words, build a Trie.
  • Understand the use of Tries in a practical situation of String Pattern matching.

Experiment Modules and their Weightage

Module Weightage Expectation
Pre-Test 10% Solve All questionss
Searching in Tries 30% Understand how to find words in a Trie
Inserting into a Trie 30% Understand how to add words to a Trie
Suffix Tries 10% Understand the real world use of Suffix Tries
Post-test 20% Solve all Questions

Pre-Test of the Experiment

Estimated Time

10 minutes

Instructions for Pre-Test

  • Pretest includes questions on what prefix and suffix are and parent-child relationships in trees.
  • If you want to revise these topics before taking the quiz, go through the Recap module first.
  • Read the questions in the quiz section and select the correct option from the ones provided. Please note that some questions may have more than one correct response.

Quick look at Sorting, Time and Space Complexity

What is a Tree?

A tree is a Data Structure which is arranged in the form of multiple nodes holding some data, connected to each other as in the figure below. The node at the top is called the root. All the nodes reachable from it, below it, are it's children, and the node connected to it, directly above it, i.e. on the path to the root is it's parent.

A Tree

Picture showing parent-child relationships in a Tree.

Time and Space Complexity

  • Time complexity of an algorithm gives the measure of time taken by it to run as a function of the length of the input. Similarly, Space complexity of an algorithm quantifies the amount of space or memory taken by an algorithm to run as a function of the length of the input.
  • Recall that if our word list (the one we are searching through) is an array of length N, and our algorithm iterates through the array once, then time complexity will be O(N). If for each element we have to perform m operations, where m is the length of the word, then the algorithm is O(NM).

A Quiz with Multiple Choice Questions

Q1. Which of the following is a prefix of the word 'ABRACADABRA'?

(A) CADABRA

(B) ABRA

(C) ABARD

(D) ABRACAD

A prefix of a word is a collection of letters that appears at the start of a word.

Q2. How many parents can each node (not the Root) of a Tree of Size N can have?

(A) N-1 Parents

(B) Upto N^2 parents

(C) 1 Parent

(D) 0 Parents

A tree is a graph where every node has one and only one parent, though one node may have multiple children.

Q3. How many parents does the root of a tree of size N can have?

(A) N-1 Parents

(B) Upto N^2 parents

(C) 1 Parent

(D) 0 Parents

A tree is a graph where every node has one and only one parent, though one node may have multiple children.

Q4. int a = 0, b = 0;
for (i = 0; i < N; i++) {
a = a + rand();
}
for (j = 0; j < M; j++) {
b = b + rand();
}
What is the time and space complexity for the above code?

(A) O(N * M) time, O(1) space

(B) O(N + M) time, O(N + M) space

(C) O(N + M) time, O(1) space

(D) O(N * M) time, O(N + M) space

The first loop runs for O(n) time, second loop runs for O(m). Constant extra space is used for 2 variables a and b, hence O(1) space.

What are Tries?

Estimated Time

30 minutes

Searching in Tries: Concept

Video for Searching in Tries

Learning Objectives of this Module:

In this module, we will:

  • See an example of a trie.
  • Observe how any word can be inserted into and searched from a trie.
  • Learn the algorithm to search for words in the trie.
  • Learn when to report failure in searching for a word.
  • Practice the algorithm.
  • Test your conceptual understanding with a short quiz.

What are Tries and how do they help in searching

Search a word from a list - Naive Approach

Let's say we have a huge list of words, for example a Dictionary (Here in this example our list is [CAT, CAP, CAN, CADET]). You want to find out whether or not a word, let's say CABLE, is in this list. The naive approach to solving this problem is as follows.

  • Store all the words in the list into separate arrays / linked lists / any data structure, as done in the figure on the left below.
  • When the new word comes, go through each word in the list and try to match it. Report a match if there is one, else a failure.

We notice here that it is takes a lot of steps to search a word, and that we waste a lot of memory. Notice all the words that start with a 'C' followed by an 'A', for which we sort those letters over and over again for each word. Let's try to improve our approach.

Why Tries should be used

Image showing how Tries can outperform linked list in searching words.

So what is a TRIE?

The way we, as humans, approach looking at words in a dictionary, is that we try to find the first letter 'C', then in the pages starting with 'C', we try and find 'A', then 'B', then 'L', and 'E'. We want to take the same approach in our algorithm. We can optimize our search in the following way:

  • We can merge all the common letters, that is if two words have a common prefix, then the prefix would be stored once and the words will branch out of it. (saves memory).
  • We can now search in one go. We see that all words start with a 'C', followed by an 'A', hence we can try finding 'B', and report failure upon it not being there. This is the approach taken by a common man to solve this problem.
You can see the illustration of this tree like method (A TRIE), in the image on the right. A trie is nothing but a Tree of letters, where each path from the root to the leaves represents a word that is made up of all the letters that were found on it's path. It helps us solve the problem of searching words in a list quickly.

How to Search in a Trie

Search Algorithm

Video for searching in a trie

Algorithm

  • We start at the Root Node.
  • We move to each of it's children and match the first letter of our word with them.
  • If none of the children match, then we report that the word is not in the list.
  • If one of the children match, then we take that as our new node, and compare it's children with the second letter.
  • We keep doing this until we are out of letters in the word we are searching for, and finally when the last letter is matched, we try to match an end-point '$'. If it's there, then the word exists, otherwise it does not.
This is the algorithm for searching. Let's now try to understand why we need endpoints.

Why $ is needed



Image showing difficulty of searching words without end character.

Why $ is needed

We have added a '^' to the start of each word and a '$' to the end. This is to solve the problem which is shown in the image above. Say there are two words HELL and HELLO in our list. One word (Hell) is the prefix of another (Hello). Now if we try to seach for HELL, we will match each letter, but we will never be able to tell if HELL was in the Trie, or it was there only because HELLO was there. We solve this by adding a $ to the end of each word, so that we can then check if there was a word HELL that ended there by also matching an end character '$'. We add the start char '^' so that all words can have a common root, thus making this a convienient tree to code up, with one root.

Demo: Insertions and Search in tries

Demo : Working with Tries

Interactive artefact where user can type in a word and ask for it to be inserted into the trie or searched for in the trie.

Step by Step Practice for Searching in Tries

Practice : Searching in Tries

Interactive artefact where user can click through the letters in a try trying to find a word and see the letters get highlighted in green if correct and red if wrong.

Hands on Exercise for Searching in Tries

Exercise : Seaching in Tries

Interactive artefact where user can try to find words generated by a computer

Revision Quiz for Bubble Sort

Q1. Upto how many children can each node in the trie have?

(A) 26 (As many letters as in the Alphabet)

(B) 27 (As many letters as in the Alphabet + 1)

(C) Depends on the size of the longest word

(D) Depends on the size of the shortest word

The number of children every node has can be any of the unique letters in the alphabet, which is A-Z and the termination character '$'. It has nothing to do with the length of the words. The length decides the depth, not the breadth.

Q2. The number of '$' characters in a Trie equals:

(A) The number of letters in the alphabet

(B) The number of insert operations performed.

(C) The number of search operations performed.

(D) The number of unique words entered into the Trie

Each unique word (only the unique inserts) added to the trie end in a '$', therefore the number of words and the number of '$' characters are the same.

Q3. If the letters '^', 'A', 'N', 'T', are encountered consecutively when searching but there is no '$' attached directly to the letter 'T', then which of the following are true?

(A) The Word "ANT" is present in the Trie and has been found.

(B) The Word "ANT" might be present somewhere else in the Trie but has not been found.

(C) The Word "ANT" is not present, but other words starting with the letters "ANT..." might be.

(D) The word ANT is not present, and is not the prefix or suffix of any other string.

If there is no '$' at the end of the search for 'ANT', then the word does not exists in the trie, but others with that prefix certainly do. This is the whole use of using the terminal character '$'.

Add words to a Trie

Estimated Time

20 minutes

Introduction

Introduction video for Tries and Suffix Tries.

Learning Objectives of this Module:

In this module, we will:

  • Learn how to insert a new word into a Trie
  • Optimize the trie to occupy less space
  • Practice adding in new words (Point of Insertion)
  • Test your conceptual understanding with a short quiz

How to Insert words into a Trie

Insertion Algorithm


We have understood what a trie is and how we can use one to search for words. Now let's take a look into how we can contruct a trie out of a collection of words by inserting each word into the trie one by one:

  • Convert any given string [WORD] into [^WORD$], where the ^ character denotes start of the word and $ denotes the end of the word. This is important as we have seen in the tries explanation, where one word is the prefix of another, eg. ANT and ANTLER, and we need to find whether ANT is present in the Trie or not.
  • We search for the letters one by one, and keep matching till they (the current prefix) are present in the trie. We stop as soon as we see the first letter that cannot be matched in the trie.
  • Now we branch out from the letter till where we matched, and add the new letter that could not be matched.
  • Now we keep adding all the remaining letters as children of the current letter which we are on, until the whole word is done and we then add a "$' to it.

Step by Step process

Image of the 4-step algorithm of inserting a word in a trie

Demonstration for Insertion in Tries

Demo for Insertion in Tries

Interactive artefact where user is tasked with finding the insertion points of words in a trie and can click the node of insertion and then node that will be inserted.

Optimizing the Trie with Path Compression

Can we store more Data in each node?


Again in our Trie, we see that there are multiple nodes each storing one letter, and only having one child. This is a waste of memory, since we can lower the number of nodes by a lot, if every single unbranched path could be compressed in a single node as below.. This idea would be extremely useful when we are working with text containing many words, and we need to look for patterns (In this way, we will be able to store only the start and end positions of these words making our nodes smaller - Don't worry if you don't understand - The idea will bear an elaborated explanation in Suffix Tries.)

A Compressed Trie



Image showing a sample compressed Trie.

Bounds on Complexity

Now we can prove stronger upper bounds on the memory complexity of the Trie.

  • All internal nodes of such a trie are branching (has atleast 2 children)
  • Each new unique word adds only one Leaf Node (a '$' sign at the end)
  • If there are N words, then the number of leaves is O(N) and number of internal nodes is also O(N), using the properties of a Tree. Therefore, total number of nodes is O(N).

This is a very crucial observation, and helps incredibly in many search operations.

A Quiz with Multiple Choice Questions

Q1. There is a Trie with the words [ CAT, CAGGLE, COG, CANOPY, CATIS ]. How many nodes does this Trie have including the start and end nodes?

(A) 5 Nodes

(B) 16 Nodes

(C) 21 Nodes

(D) 22 Nodes

The words are [ CAT, CAGGLE, COG, CANOPY, CATIS ]. When we insert CAT, we get 5 letters, ^CAT$. Then when we add CAGGLE, we add 5 letters again, GGLE$. we add COG with 3 letters OG$, then CANOPY with 5 letters NOPY$ and finally CATIS with 3 letters IS$. So the total is 5 + 5 + 3 + 5 + 3 = 21 nodes.

Q2. What is the memory complexity of a Trie with no path compression?

(A) O(Number of Words)

(B) O(Length of the Longest Word)

(C) O(Sum of Lengths of Words)

(D) O(Total Number of Unique Characters)

The worst case scenario is that none of the nodes get merged where we have each node containing a letter, so the number of letters is at max the sum of lengths of all words.

Q3. In the Trie with words [ HELLO, HELSINKI, HILBERT ] we are adding in the word HILLS. How many nodes will be created and added in this process?

(A) 0 Nodes

(B) 5 Nodes

(C) 2 Nodes

(D) 3 Nodes

(E) 1 Node

^HIL is already present, so we need to add in LS$, which is 3 new letters, 3 new nodes.

Solving String Search

Estimated Time

20 minutes

Introduction

Introduction video for Suffix Tries.

Suffix Trie for ABRACADABRA.

Image showing an example of a Suffix Tree.

Learning Objectives of this Module:

In this module, we will learn:

  • What a Suffix Tree is
  • A naive method of creating a suffix tree
  • Idea of how to solve string searching using a suffix tree
  • Use of Suffix Tries in the real world
  • Understand the maximum complexity of the Suffix Tree

Demo for how Suffix Tries are created for a String

Finding Patterns in a String

Now that we know how to search for a string in a list of strings, let's turn to finding a pattern (a string or a regular expression) in a huge bulk of text. This is what you do when you press CTRL+F to search in Microsoft Word. Lets try to see how we would achieve this:

  • We already know how to find a word in a huge list.
  • If we do not match till '$', we can also find any prefix in a list of words.
  • If our list is the set of all suffixes of a string, then we can match any prefix of suffix.
  • Any prefix of suffix is a substring, so we want to take the list of all suffixes of our text and match any prefix of it.
  • We can use a compressed trie to store this in O(n) space, since there are O(n) unique suffixes in a text of length n.

Search in a Text editor can be implemented this way.

Image of Example search in a text editor showing multiple results.

Demo for how Suffix Tries are created for a String

Exercise: Suffix Tries

Interactive artefact that take a word and generates it's Suffix Tree.

Post Test of the Experiment

Estimated Time

10 minutes

Instructions for Quiz

Post Test includes questions on entire experiment concepts. Read the questions in the quiz section and select the correct option from the ones provided. Please note that some questions may have more than one correct response.

ADVICE: Do sit with a Pen and a Paper, you may need to draw out some Tries while answering these questions.

A Quiz with Multiple Choice Questions

Q1. How many children would the root (^) have in the suffix tree of the word "SHESELLSSEASHELLS"?

(A) 3

(B) 5

(C) 13

(D) 17

The number of unique children of the root is the number of unique characters in the string, that is 5 (S, H, E, A, L).

Q2. In the suffix tree of "SHESELLSSEASHELLS", S is a child of the root. How many direct children does this node containing S have?

(A) 3 (H, E, S)

(B) 3 (H, E, A)

(C) 4 (S, H, E, A)

(D) Only 1 child

The letter S appears in the string followed by H, E and S only, therefore those are it's 3 children.

Q3. If you have a text file of n characters containing words all of which are under k characters in length, and you want to search for a word of length less than m in it, what can be the estimated complexity of this search? (Answers are in the form <PreprocessingTime, SearchTime>) Note: We are only trying to search for complete words, not patterns.

(A) <O(N), O(K)>

(B) <O(N), O(M)>

(C) <O(NK), O(M)>

(D) <O(K), O(M)>

For text of length N, it takes O(N) steps to insert all the words in a trie. Then to search in it for a word of length M, we need O(26 M) = O(M) steps.

Q4. Which of the following tasks can a Trie or a Suffix Tree be used for? (Select ALL that apply)

(A) In the Search feature of Microsoft Word.

(B) In sorting emails based on date.

(C) In Matching >Regular Expressions. (Patters of Text)

(D) Finnding synonyms from a Dictionary

(E) Suggesting autocomplete as you are typing.

We can use Suffix Tries for searching regular strings in a text editor, or in matching regular expressions (patterns - strings that have *, ., etc to represent any character or a group of characters). We can also use them for predictive text - all the words you ususally type form a Trie, and when you search for a part of a word that you have already typed out, all the members of the subtree of the result of the search are possible words you may want to type.

Further Readings and References of the Experiment

Explore More on Tries and Suffix Tries

You can explore more about Tries through following resources:


Useful Links :

Programming Challenges: If you have mastered Tries, write a program that Checks the Spellings of words in a file, and quickly reports all the misspelled words that exist. (Remember that the text is huge, so implement an efficient algorithm). You can try submitting this code here.

You can also start reading up on other varients/improvements over Suffix Tries:

  • Suffix Tries
  • Suffix Arrays
Also, for searching in string do look up the module on:
  • Knuth Morris Pratt (KMP) Algorithm