Suffix Arrays
Created: 2019-11-06 Wed 15:37
Objectives
Your Objectives:
- Explain how to find an occurrence of any substring in \(\log n\) time.
- Find the longest common substring in a given text.
Example
- Suppose you have a string of 100,000,000 characters. You want to know
if a certain string is in there. How can we do this quickly?
- E.g. Genetic codes
GATTACAGATTACAGATTACA
The idea
- There is a data structure called a "suffix tree".
- It uses a lot of memory, so we are not going to use it.
- We will use a suffix array instead.
Example
- Suppose we have this string:
this is his
.
| 0 | this is his
| 1 | his is his
| 2 | is is his
| 3 | s is his
| 4 | is his
| 5 | is his
| 6 | s his
| 7 | his
| 8 | his
| 9 | is
| 10 | s
Example, ctd.
- Next, sort the substrings…
| 7 | his
| 4 | is his
| 8 | his
| 1 | his is his
| 9 | is
| 5 | is his
| 2 | is is his
| 10 | s
| 6 | s his
| 3 | s is his
| 0 | this is his
Details
+--------------------------------------------+
| 7 | 4 | 8 | 1 | 9 | 5 | 2 | 10 | 6 | 3 | 0 |
+--------------------------------------------+
- We use a separate array to hold just the indices.
- Time complexity is \(O(n^2 \log n) \)
- There are \(O(n \log n) \) and \(O(n)\) algorithms too!
- How to use it…
- To search for the string
his
- To find the length of the longest common subsequence?
Related structure: LCS array
| 7 | his | |
| 4 | is his | 1 |
| 8 | his | 0 |
| 1 | his is his | 3 |
| 9 | is | 0 |
| 5 | is his | 2 |
| 2 | is is his | 3 |
| 10 | s | 0 |
| 6 | s his | 1 |
| 3 | s is his | 2 |
| 0 | this is his | 0 |