1.5 hours
Introduction video for Tries and Suffix Tries.
This experiment requires you to have basic knowledge about :
And above all, a curiosity to learn and explore!
In this experiment, you will be able to do the following:
Module | Weightage | Expectation |
---|---|---|
Pre-Test | 10% | Solve All questionss |
Searching in Tries | 30% | Understand how to find words in a Trie |
Inserting into a Trie | 30% | Understand how to add words to a Trie |
Suffix Tries | 10% | Understand the real world use of Suffix Tries |
Post-test | 20% | Solve all Questions |
10 minutes
A tree is a Data Structure which is arranged in the form of multiple nodes holding some data, connected to each other as in the figure below. The node at the top is called the root. All the nodes reachable from it, below it, are it's children, and the node connected to it, directly above it, i.e. on the path to the root is it's parent.
Q1. Which of the following is a prefix of the word 'ABRACADABRA'?
(A) CADABRA
(B) ABRA
(C) ABARD
(D) ABRACAD
A prefix of a word is a collection of letters that appears at the start of a word.
Q2. How many parents can each node (not the Root) of a Tree of Size N can have?
(A) N-1 Parents
(B) Upto N^2 parents
(C) 1 Parent
(D) 0 Parents
A tree is a graph where every node has one and only one parent, though one node may have multiple children.
Q3. How many parents does the root of a tree of size N can have?
(A) N-1 Parents
(B) Upto N^2 parents
(C) 1 Parent
(D) 0 Parents
A tree is a graph where every node has one and only one parent, though one node may have multiple children.
Q4. int a = 0, b = 0;
for (i = 0; i < N; i++) {
a = a + rand();
}
for (j = 0; j < M; j++) {
b = b + rand();
}
What is the time and space
complexity for the above code?
(A) O(N * M) time, O(1) space
(B) O(N + M) time, O(N + M) space
(C) O(N + M) time, O(1) space
(D) O(N * M) time, O(N + M) space
The first loop runs for O(n) time, second loop runs for O(m). Constant extra space is used for 2 variables a and b, hence O(1) space.
30 minutes
Video for Searching in Tries
In this module, we will:
Let's say we have a huge list of words, for example a Dictionary (Here in this example our list is [CAT, CAP, CAN, CADET]). You want to find out whether or not a word, let's say CABLE, is in this list. The naive approach to solving this problem is as follows.
We notice here that it is takes a lot of steps to search a word, and that we waste a lot of memory. Notice all the words that start with a 'C' followed by an 'A', for which we sort those letters over and over again for each word. Let's try to improve our approach.
The way we, as humans, approach looking at words in a dictionary, is that we try to find the first letter 'C', then in the pages starting with 'C', we try and find 'A', then 'B', then 'L', and 'E'. We want to take the same approach in our algorithm. We can optimize our search in the following way:
Video for searching in a trie
We have added a '^' to the start of each word and a '$' to the end. This is to solve the problem which is shown in the image above. Say there are two words HELL and HELLO in our list. One word (Hell) is the prefix of another (Hello). Now if we try to seach for HELL, we will match each letter, but we will never be able to tell if HELL was in the Trie, or it was there only because HELLO was there. We solve this by adding a $ to the end of each word, so that we can then check if there was a word HELL that ended there by also matching an end character '$'. We add the start char '^' so that all words can have a common root, thus making this a convienient tree to code up, with one root.
Interactive artefact where user can type in a word and ask for it to be inserted into the trie or searched for in the trie.
Interactive artefact where user can click through the letters in a try trying to find a word and see the letters get highlighted in green if correct and red if wrong.
Interactive artefact where user can try to find words generated by a computer
Q1. Upto how many children can each node in the trie have?
(A) 26 (As many letters as in the Alphabet)
(B) 27 (As many letters as in the Alphabet + 1)
(C) Depends on the size of the longest word
(D) Depends on the size of the shortest word
The number of children every node has can be any of the unique letters in the alphabet, which is A-Z and the termination character '$'. It has nothing to do with the length of the words. The length decides the depth, not the breadth.
Q2. The number of '$' characters in a Trie equals:
(A) The number of letters in the alphabet
(B) The number of insert operations performed.
(C) The number of search operations performed.
(D) The number of unique words entered into the Trie
Each unique word (only the unique inserts) added to the trie end in a '$', therefore the number of words and the number of '$' characters are the same.
Q3. If the letters '^', 'A', 'N', 'T', are encountered consecutively when searching but there is no '$' attached directly to the letter 'T', then which of the following are true?
(A) The Word "ANT" is present in the Trie and has been found.
(B) The Word "ANT" might be present somewhere else in the Trie but has not been found.
(C) The Word "ANT" is not present, but other words starting with the letters "ANT..." might be.
(D) The word ANT is not present, and is not the prefix or suffix of any other string.
If there is no '$' at the end of the search for 'ANT', then the word does not exists in the trie, but others with that prefix certainly do. This is the whole use of using the terminal character '$'.
20 minutes
Introduction video for Tries and Suffix Tries.
In this module, we will:
We have understood what a trie is and how we can use one to search for words. Now let's take a look into how we can contruct a trie out of a collection of words by inserting each word into the trie one by one:
Image of the 4-step algorithm of inserting a word in a trie
Interactive artefact where user is tasked with finding the insertion points of words in a trie and can click the node of insertion and then node that will be inserted.
Again in our Trie, we see that there are multiple nodes each storing one letter, and only having one child. This is a waste of memory, since we can lower the number of nodes by a lot, if every single unbranched path could be compressed in a single node as below.. This idea would be extremely useful when we are working with text containing many words, and we need to look for patterns (In this way, we will be able to store only the start and end positions of these words making our nodes smaller - Don't worry if you don't understand - The idea will bear an elaborated explanation in Suffix Tries.)
Now we can prove stronger upper bounds on the memory complexity of the Trie.
Q1. There is a Trie with the words [ CAT, CAGGLE, COG, CANOPY, CATIS ]. How many nodes does this Trie have including the start and end nodes?
(A) 5 Nodes
(B) 16 Nodes
(C) 21 Nodes
(D) 22 Nodes
The words are [ CAT, CAGGLE, COG, CANOPY, CATIS ]. When we insert CAT, we get 5 letters, ^CAT$. Then when we add CAGGLE, we add 5 letters again, GGLE$. we add COG with 3 letters OG$, then CANOPY with 5 letters NOPY$ and finally CATIS with 3 letters IS$. So the total is 5 + 5 + 3 + 5 + 3 = 21 nodes.
Q2. What is the memory complexity of a Trie with no path compression?
(A) O(Number of Words)
(B) O(Length of the Longest Word)
(C) O(Sum of Lengths of Words)
(D) O(Total Number of Unique Characters)
The worst case scenario is that none of the nodes get merged where we have each node containing a letter, so the number of letters is at max the sum of lengths of all words.
Q3. In the Trie with words [ HELLO, HELSINKI, HILBERT ] we are adding in the word HILLS. How many nodes will be created and added in this process?
(A) 0 Nodes
(B) 5 Nodes
(C) 2 Nodes
(D) 3 Nodes
(E) 1 Node
^HIL is already present, so we need to add in LS$, which is 3 new letters, 3 new nodes.
20 minutes
Introduction video for Suffix Tries.
Image showing an example of a Suffix Tree.
In this module, we will learn:
Now that we know how to search for a string in a list of strings, let's turn to finding a pattern (a string or a regular expression) in a huge bulk of text. This is what you do when you press CTRL+F to search in Microsoft Word. Lets try to see how we would achieve this:
Image of Example search in a text editor showing multiple results.
Interactive artefact that take a word and generates it's Suffix Tree.
10 minutes
Post Test includes questions on entire experiment concepts. Read the questions in the quiz section and select the correct option from the ones provided. Please note that some questions may have more than one correct response.
ADVICE: Do sit with a Pen and a Paper, you may need to draw out some Tries while answering these questions.
Q1. How many children would the root (^) have in the suffix tree of the word "SHESELLSSEASHELLS"?
(A) 3
(B) 5
(C) 13
(D) 17
The number of unique children of the root is the number of unique characters in the string, that is 5 (S, H, E, A, L).
Q2. In the suffix tree of "SHESELLSSEASHELLS", S is a child of the root. How many direct children does this node containing S have?
(A) 3 (H, E, S)
(B) 3 (H, E, A)
(C) 4 (S, H, E, A)
(D) Only 1 child
The letter S appears in the string followed by H, E and S only, therefore those are it's 3 children.
Q3. If you have a text file of n characters containing words all of which are under k characters in length, and you want to search for a word of length less than m in it, what can be the estimated complexity of this search? (Answers are in the form <PreprocessingTime, SearchTime>) Note: We are only trying to search for complete words, not patterns.
(A) <O(N), O(K)>
(B) <O(N), O(M)>
(C) <O(NK), O(M)>
(D) <O(K), O(M)>
For text of length N, it takes O(N) steps to insert all the words in a trie. Then to search in it for a word of length M, we need O(26 M) = O(M) steps.
Q4. Which of the following tasks can a Trie or a Suffix Tree be used for? (Select ALL that apply)
(A) In the Search feature of Microsoft Word.
(B) In sorting emails based on date.
(C) In Matching >Regular Expressions. (Patters of Text)
(D) Finnding synonyms from a Dictionary
(E) Suggesting autocomplete as you are typing.
We can use Suffix Tries for searching regular strings in a text editor, or in matching regular expressions (patterns - strings that have *, ., etc to represent any character or a group of characters). We can also use them for predictive text - all the words you ususally type form a Trie, and when you search for a part of a word that you have already typed out, all the members of the subtree of the result of the search are possible words you may want to type.
You can explore more about Tries through following resources:
Useful Links :
Programming Challenges: If you have mastered Tries, write a program that Checks the Spellings of words in a file, and quickly reports all the misspelled words that exist. (Remember that the text is huge, so implement an efficient algorithm). You can try submitting this code here.
You can also start reading up on other varients/improvements over Suffix Tries: