## Sunday, March 20, 2011

### A String Problem: Relative Suffix Tree

Given a string S of length n, its absolute suffix tree is a compressed trie of all the n suffixes of S. This is the classical data structure that is popular.

Given a string S of length n and a reference string R of length m, the relative suffix tree of S is its absolute suffix tree but with edges labelled by substrings NOT in R being pruned away (together with the subtree under those edges). Say m < n.

Is relative suffix tree meaningful, nontrivial? Any research problems, a la, best running time?

JeffE said...

This is a perfect question for cstheory.stackexchange.org !

4:00 PM
Unknown said...

Meaningful? Not sure. Non-trivial? Well, probably can be implemented along the following lines in O(n log log n) time, where n=|S|.

Build 2 suffix trees, one for S and one for S#R. For the latter, add functionality of weighted ancestor queries (see Perfect Hashing for Strings... Farach and Muthu), where each node is weighted with the length of the string (root-to-node) it represents. Also color each node in the suffix tree of S#R blue if it has leaf representing suffix from R or not.

For each edge in S labelled [i,j] from S, go to leaf representing suffix S_i in the suffix tree of S#R and do a weighted ancestor query with j-i (in O(log log n)) time. If the node is blue then the S-edge appears in R, o/w it does not and can be removed.

The overall time is O(n log log n).

--Moshe Lewenstein

5:53 PM
Anonymous said...

Hi Moshe,

Nice!

Say T(S) is the suffix tree of S.
Thought: In this case, I wonder if one might be able to jointly traverse T(S) and T(S#R) and try to answer all the S[i,j] queries needed in O(n) time altogether in a batch mode (even though \sum_edges |S[i,j]| is large) because each "edge string" is also a path from the root in the suffix tree.

- Metoo

5:47 AM
Unknown said...

Sure. You're right. It can be solved in O(n) time by the scan you suggest.

I like the problem. The definition is appealing. Wonder if there are natural extensions that are more challenging...

9:35 AM
Anonymous said...

Howdy Muthu (and Moshe),

I think a (possibly) simpler solution would be to traverse the suffix tree using (inverse) suffix links. For every node that is colored blue, any suffix link connected to it leads to a node that needs to be pruned. A linear time scan will suffice in detecting all such nodes.

Also, there is no need to create the suffix tree for S, one can just prune the suffix tree of S#R.

Regards,

- Tsvi

5:24 PM