A String Problem: Relative Suffix Tree
Given a string S of length n, its absolute suffix tree is a compressed trie of all the n suffixes of S. This is the classical data structure that is popular.
Given a string S of length n and a reference string R of length m, the relative suffix tree of S is its absolute suffix tree but with edges labelled by substrings NOT in R being pruned away (together with the subtree under those edges). Say m < n.
Is relative suffix tree meaningful, nontrivial? Any research problems, a la, best running time?
Given a string S of length n and a reference string R of length m, the relative suffix tree of S is its absolute suffix tree but with edges labelled by substrings NOT in R being pruned away (together with the subtree under those edges). Say m < n.
Is relative suffix tree meaningful, nontrivial? Any research problems, a la, best running time?
5 Comments:
This is a perfect question for cstheory.stackexchange.org !
Meaningful? Not sure. Non-trivial? Well, probably can be implemented along the following lines in O(n log log n) time, where n=|S|.
Build 2 suffix trees, one for S and one for S#R. For the latter, add functionality of weighted ancestor queries (see Perfect Hashing for Strings... Farach and Muthu), where each node is weighted with the length of the string (root-to-node) it represents. Also color each node in the suffix tree of S#R blue if it has leaf representing suffix from R or not.
For each edge in S labelled [i,j] from S, go to leaf representing suffix S_i in the suffix tree of S#R and do a weighted ancestor query with j-i (in O(log log n)) time. If the node is blue then the S-edge appears in R, o/w it does not and can be removed.
The overall time is O(n log log n).
--Moshe Lewenstein
Hi Moshe,
Nice!
Say T(S) is the suffix tree of S.
Thought: In this case, I wonder if one might be able to jointly traverse T(S) and T(S#R) and try to answer all the S[i,j] queries needed in O(n) time altogether in a batch mode (even though \sum_edges |S[i,j]| is large) because each "edge string" is also a path from the root in the suffix tree.
- Metoo
Sure. You're right. It can be solved in O(n) time by the scan you suggest.
I like the problem. The definition is appealing. Wonder if there are natural extensions that are more challenging...
Howdy Muthu (and Moshe),
I think a (possibly) simpler solution would be to traverse the suffix tree using (inverse) suffix links. For every node that is colored blue, any suffix link connected to it leads to a node that needs to be pruned. A linear time scan will suffice in detecting all such nodes.
Also, there is no need to create the suffix tree for S, one can just prune the suffix tree of S#R.
Regards,
- Tsvi
Post a Comment
<< Home