Introduction to KMP Algorithm

Estimated Time

2.5 hour

Prerequisites of the Experiment

This experiment requires you to have basic knowledge about :

And above all, a curiosity to learn and explore!

Overview of the Experiment

  • The aim of this experiment is to understand the intricacies of string searching and then delve further into the Knuth-Morris-Pratt Algorithm, its Time and Space Complexity, and how it compares against naive searching algorithm.
  • The experiment features a series of modules with video lectures, interactive demonstrations, simulations, hands-on practice exercises and quizzes for self analysis.

Learning Objectives of the Experiment

In this experiment, you will be able to do the following:

  • Search for a pattern in the string using naive string search method.
  • Find the Lowest-Prefix-Suffix array for a search pattern by preprocessing
  • Given a search string and a pattern to search in that string, you will be able to search for that pattern in the search string using KMP algorithm
  • Understand the intricacies of KMP algorithm and Naive string searching algorithm.

Experiment Modules and their Weightage

Module Weightage Expectation
Pre-Test 10% Solve All Questions
Naive String Searching 25% Understand basic algorithm of Naive String Searching
Preprocessing for LPS 25% Understand the Lowest-Prefix-Suffix array creation process
KMP Algorithm 25% Understand the Knuth-Morris-Pratt Algorithm
Post-test 15% Solve all Questions

Pre-Test of the Experiment

Estimated Time

10 minutes

Instructions for Pre-Test

  • Pretest includes questions on prefix and suffix
  • Read the questions in the quiz section and select the correct option from the ones provided. Please note that some questions may have more than one correct response.

A Quiz with Multiple Choice Questions about Substrings and Time-Complexity notations

Q1. Which of these best describes a substring?

(A) A contiguous sequence of characters within a string in any order

(B) contiguous sequence of characters within a string in the same order

(C) Any combination of a subset of characters from a string in any order.

(D) Any combination of a subset of characters from a string in the same order as the string.

A substring is a contiguous sequence of characters within a string. For instance, "the best of" is a substring of "It was the best of times". This is not to be confused with subsequence, which is a generalization of substring. For example, "Itwastimes" is a subsequence of "It was the best of times", but not a substring.

Q2. Which one of the following is not a prefix of the word "TIKI-TAKA-TIK-TI"?

(A) TAKA

(B) TIKI

(C) TI

(D) TIK

TAKA is not a prefix of the word.

Q3. What does it mean when we say that an algorithm X is asymptotically more efficient than Y?

(A) X will always be a better choice for small inputs

(B) X will always be a better choice for large inputs

(C) Y will always be a better choice for small inputs

(D) X will always be a better choice for all inputs

Performance could be similar for small inputs, but definitely better for larger inputs.

Q4. int a = 0, b = 0;
for (i = 0; i < N; i++) {
a = a + rand();
}
for (j = 0; j < M; j++) {
b = b + rand();
}
What is the time and space complexity for the above code?

(A) O(N * M) time, O(1) space

(B) O(N + M) time, O(N + M) space

(C) O(N + M) time, O(1) space

(D) O(N * M) time, O(N + M) space

The first loop runs for O(n) time, second loop runs for O(m). Constant extra space is used for 2 variables a and b, hence O(1) space.

Naive String Searching

Estimated Time

30 minutes

Learning Objectives of this Module

In this module, we will :

  • Get introduced to string searching
  • Learn the algorithm of naive string searching
  • Learn the time and space complexity of naive string searching
  • Practice the algorithm
  • Test your conceptual understanding with a short quiz

Introduction : Why?

Why we need String Searching?

  • Surprisingly, strings are more often used in computer science than numbers. May it be editing text documents in word processors, reading our favourite blogs or even a basic search on google.
  • Strings are everywhere. Hence it is important to study and understand the methods to manipulate, traverse and search among them.

Real Life Applications

Efficient string searching is crucial to the way the internet functions today. They have a plethora of uses and applications:

  • Spell Checkers
  • Intrusion Detection systems
  • Predictive Seach Engines
  • Plagiarism detection
  • Information Retrieval System etc.

Basic Approach

If we were to approach this problem of searching for a pattern inside a string, the first trivial solution would be a brute-force method. We will be examining this method in the next task of this learning unit.

How Naive String Searching Works?

Explanation

The brute-force method to find a pattern in a given string text is:

  • Have two pointers strInd and patInd pointing at string beginning and pattern beginning.
  • Compare the characters present at the pointer.
  • If they match, increase both the pointers by one.
  • Repeat unless all the characters of pattern match or there is a mismatch
  • If all the letters of the pattern have been matched then we have found the pattern in the string.
  • If there is a mismatch:
    • Bring the patInd pointer back to the beginning of the pattern
    • Bring the strInd pointer to the second position of the string now and Repeat the process.
    • If another mismatch is encountered , bring the strInd to the 3rd position of the string and so on.
  • If strInd ever increases beyond the length of the string then return 0 since it means that the pattern doesn't exist in the string.

Iteration by Iteration Visualization of Naive String Searching Algorithm

Image shows Iteration by Iteration Visualization of Naive basiyan

Time Complexity

Note: It is recommended to see the interactive demo task before reading this.

Deciding the big O notation for any algorithm is a function of the worst-case scenario of the algorithm. Let us consider a case - the pattern of length M to be searched in a string of length N always fails on the pattern's last character and succeeds only at the end of the string.

Example :
String : AAAAAAAAAAAAAAAAAAAAAAX
Pattern : AAAAAX

Assuming that the time required to compare two characters is O(1) then the above case will take Time = O(M \times N)
Hence the naive string searching algorithm is of the complexity \(O(M \times N)\)

Demonstration of the Algorithm

Demo : Simple String Searching Demo

Interactive artefact where user can click on Next Step to understand how string searching works.

Revision Quiz for Naive String Searching

Q1. What do we do when matching of a character fails while comparing the pattern to the string?

(A) Shift the pattern by one unit

(B) Set the value of strInd to strInd - patInd + 1

(C) Set the value of patInd to 0

(D) All of the above

The pattern is shifted to the right by one by setting the patInd to 0 after setting strInd to strInd - patInd + 1

Q2. What is theoretrically the best time complexity for any string searching algorithm for searching a pattern of length \(M\) in a text of length \(N\)?

(A) \( O(N+M) \)

(B) \( O(M \times N) \)

(C) \( O(N^M) \)

(D) None of these

It is necessary to read all the characters of the text atleast once inorder to deterministically say that the pattern doesn't exist in the text.

Q3. What is the time complexity of Naive-String-Searching Algorithm?

(A) \( O(N+M) \)

(B) \( O(N \times M) \)

(C) \( O(N^M) \)

(D) None of these

Refer to "Concept and Strategy : Naive" for the explanation.

Knuth-Morris-Pratt Algorithm

Estimated Time

40 minutes

Learning Objectives of this Module

In this module, we will:

  • Learn KMP string searching algorithm.
  • Observe some characteristics of the algorithm
  • Understand how we can use them optimise the algorithm
  • Practice the algorithm
  • Test your conceptual understanding with a short quiz

Intro to Lowest-Prefix-Suffix Array

Building Intuition

  • Let us think of a method by which we can make our bruteforce method more efficient. We notice that we have been doing redundant matching when the matching of a pattern in a string fails since we are moving the string Index back. Can we possibly track the information we get from previous comparisons?
  • We have to exploit the degenerating property of the pattern we are searching. i.e pattern having same sub-patterns appearing more than once in the pattern.

Finding Repeating Substrings

image for showing the overlap goes here

The Longest-Prefix-Suffix Array

The name LPS indicates the longest proper prefix which is also a suffix of the matched pattern.

Example:
AA in
AABAA is a prefix as well as a suffix.
In case AABAAC is the pattern which we are searching in some unknown string. And we happen to know that after successfully matching AABAA, C is a mismatch.
Then we can start matching the 3rd letter of pattern(B) to the letter at strInd instead of starting from the beginning.

Definition :
Let us define the LPS array for a pattern such that lps[i] stores the index of the pattern from where we must begin comparing if the match fails after rightly matching the pattern until that point.

Concept And Strategy : KMP

Explanation

Unlike the bruteforce approach where we just shift the pattern by and start comparing again, we must use a value from the LPS Array to decide which character to compare next. The core idea is to avoid comparing the characters which we know will match correctly anyway.

Algorithm

  • We start comparing pattern (pat[patInd]) with patInd = 0
  • We keep comparing letters pat[patInd] and string[strInd] and increment patInd and strInd as long as they match
  • When we see a mismatch :
    • We are aware that pat[0 to patInd-1] match correctly with string[strInd-patInd to strInd-1].
    • From the way we had defined LPS Array, we know that LPS[patInd-1] is a count of characters of pat[0 to patInd-1] which are both prefix and suffix.
    • So, we don’t need to compare these characters as they are going to match anyway.

Step by Step iteration example of KMP Algorithm

Image for kmp algo made by friendly neighbourhood spiderman

Pseudocode

                            strInd = 0
                            patInd = 0
                            
                            while(strInd < string.length()) {
                                if(pat[patInd] == string[strInd]){
                                    if(patInd == pat.length()-1){
                                        print("Match Found at" + strInd-pat.length() )
                                        return 
                                    }
                                    
                                    strInd++;
                                    patInd++;
                                }
                                else {
                                    if(patInd != 0){
                                        patInd = LPS[patInd-1]
                                    }
                                    else {
                                        strInd++;
                                    }
                                }
                            }
                            

Demonstration for KMP String Matching

KMP String Searching Demo

Interactive artefact where user can click on Next Step to understand how the kmp string searching algorithm works.

Demonstration for KMP String Matching

KMP String Searching Practice

Interactive artefact where user can click on Next Step to understand how the kmp string searching algorithm works.

Demonstration for KMP String Matching

KMP String Searching Exercise

Interactive artefact where user can click on Next Step to understand how the kmp string searching algorithm works.

A Quiz with Multiple Choice Questions

Q1. Which one of the following is true in KMP String searching algorithm?

(A) The value of strInd always increases in all iterations.

(B) The value of strInd remains same in all the iterations.

(C) The value of strInd may or may not increase in all the iterations but it never decreases.

(D) None of the above

Refer "Concept and Strategy : KMP " for explanation.

Q2. How does KMP improve the brute-force-method?

(A) By comparing the characters right to left instead of left to right.

(B) By searching for text in pattern instead of pattern in string

(C) By not doing reduntant comparisons and using the information gained from previous comparisons

(D) None of the above

Refer to "Concept and Strategy : KMP"

Q3. What is the time complexity of KMP Algorithm?

(A) \( O(M \times N) \)

(B) \( O(M ^ N) \)

(C) \( O(M + N) \)

(D) All of the above

The complexity of this algorithm will be discussed in detail in the next learning unit.

Preprocessing of the KMP Algorithm

Estimated Time

35 minutes

Learning Objectives of this Module

In this module, we will be learning about :

  • Time and Space Complexity: We will learn about the running time of the preprocessing and KMP algorithm.
  • The preprocessing algorithm required for the KMP algorithm.
  • Comparison with the Naive string searching algorithm.

Concept and Strategy : Preprocessing

Introduction

  • The preprocessing for the KMP algorithm involves using the pattern string to create an auxiliary array called LPS (of size same as that of the pattern) which will be used to skip character comparisons while matching to save time.
  • The name LPS indicates longest proper prefix which is also a proper suffix. LPS[i] essentially stores the length of maximum proper prefix which is also a suffix of the string LPS[0 to i] .

Explanation and Algorithm

How to fill LPS[i]?
  • We need to know the longest suffix until i that matches with prefix. The longest suffix till i-1 that matches with some prefix (say Pat[0 to j-1]) is Pat[i-j to i-1] .
  • If the character Pat[j] matches with Pat[i] then the value of LPS[i] = LPS[i-1] + 1
  • And if Pat[i] != Pat[j] , then the longest suffix till i might be the second longest suffix till i-1 along with the character Pat[i] and the character next to the second longest prefix of i-1 and so on.
  • How do we find the second longest suffix until i-1 and so on?
    • We know that Pat[0 to j-1] is same as Pat[i-j to i-1] and LPS[i-1] = j .
    • Therefore, finding the second longest prefix suffix is same as finding the longest prefix suffix of Pat[0 to j-1] since this suffix of Pat[0 to i-1] would also be the suffix of Pat[0 to j-1].
    • LPS[i] would be equal to LPS[LPS[i-1]] + 1 if Pat[i] = Pat[LPS[i-1]].
    • This same logic can be used recursively to find LPS[i].

Pictorial Representation of Preprocessing

Image here

Pseudocode

                                function LPS_ArrayGenerator(string Pat){
                                    LPS = Array[Pat.length] 
                                    
                                    j = 0
                                    LPS[0] = 0
                                    
                                    for(int i=1 ; i < Pat.length() ; i++){
                                        while( j > 0 && Pat[i] != Pat[j] ) {
                                            j = LPS[j-1]
                                        }
                                        if(Pat[i] == Pat[j]){ j++ }
                                        LPS[i] = j 
                                    }
                                    
                                    return LPS
                                }
                            

Demonstration of the Preprocessing Algorithm

Preprocessing Demonstration

Interactive artefact where user can click on Next Step to understand how the kmp string searching algorithm works.

Time and space complexity

Time complexity of the preprocessing algorithm

  • There are two nested loops in the algorithm.
  • The outer loop increases the value of j by 1 in each iteration and hence it can increase the value of j to the maximum of \(m\) (where \(m\) = length of the pattern) throughout the runtime of the algorithm.
  • The inner while loop decreases the value of j by 1 in each iteration hence it can run for the maximum of m times throughout the runtime of the algorithm.
  • There the upper bound to the runtime of the entire preprocessing algorithm is \(2m\) which is equal to \(O(m)\) .

Time complexity of the KMP search algorithm

  • In the entire run of the algorithm we observe that strInd(variable that tracks the index of the string) never decreases and increases by atleast 1 in each iteration of the code.
  • This shows that the upper bound of the runtime of the algorithm is the length of the text string in which the pattern has to be searched.
  • Hence the time complexity = \(O(n)\) where n = length of the text string.

Comparison with the Naive string searching algorithm

  • The time complexity of the naive string searching algorithm was found to be \(O(m \times n)\) in the worst case.
  • Whereas we have found a method to reduce the runtime complexity of the algorithm to \(O(m + n)\) using the information we gain from all our previous comparisons.

A Quiz with Multiple Choice Questions

Q1. What is the time complexity of the prepropressing function for a pattern of length \(M\) ?

(A) \(O(M^2)\)

(B) \(O(M^3)\)

(C) \(O(\sqrt{M})\)

(D) \(O(M)\)

Refer to "Concept and Strategy : Preprocessing"

Q2. What is the longest-prefix-suffix for: "ANIDAPOPOANIDA"?

(A) 5

(B) 6

(C) 2

(D) 3

ANIDA is the longest prefix which is also a suffix.

Q3. What is the longest-prefix-suffix for TOOTOOTOOT?

(A) 1

(B) 4

(C) 5

(D) 2

TOOT is the longest prefix suffix.

Post Test of the Experiment

Estimated Time

10 minutes

Instructions for Quiz

Post Test includes questions on entire experiment concepts. Read the questions in the quiz section and select the correct option from the ones provided. Please note that some questions may have more than one correct response.

A Quiz with Multiple Choice Questions

Q1. What is the overall complexity of the KMP-String-Searching algorithm?

(A) \(O(N \times M)\)

(B) \(O(N^2)\)

(C) \(O(N+M)\)

(D) None of the above

Refer to the Analysis task in preprocessing learning Unit.

Q2. What is the space complexity for the KMP String Searching algorithm?

(A) \(O(M)\)

(B)\(O(N)\)

(C) \(O(N+M)\)

(D) None of the above

\(O(M)\) to store the LPS Array.

Q3. What is the LPS array for the pattern: "ababbabbabbababbabb" ?

(A) 001201201001232101

(B) 0012012012012345678

(C) 012344567891230120

(D) 012344567891230123

Enter the above pattern in the preprocessing demonstration pane to get a thorough understanding.

Q4. Find the LPS array for "AABAAABAA"?

(A) 010122345

(B) 010000123

(C) 012345678

(D) 010010010

Enter the above pattern in the preprocessing demonstration pane to get a thorough understanding.

Q5. Using the LPS array from the previous question, which character would you start comparing from if you know that the matching failed at AABAAABAA?

(A) 0

(B) 1

(C) 6

(D) 3

A look at the LPS array(010122345) would reveal to us that AABAAAB has longest-suffix-prefix of length 3.

Further Readings and References of the Experiment

Explore More About KMP algorithm

You can explore more about the KMP string searching algorithm through the following resources :

Useful Links : Reference Texts :
  • Introduction to Algorithms : Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
  • Data Structures and Algorithms : Mark A. Weiss