### Median substring

To a few who emailed and wanted to know: in the Duplicate problem below, the array is destroyed if it is not a permutation of the input. An anonymous mailer asked if an O(n) time algorithm exists when arithmetic operations are allowed.

Here is another algorithmic problem. Given an array A[1,n] of numbers accessible by comparisons only, it is well known that selection---finding them item of any rank k in the sorted order---takes O(n) time and is easier than sorting which takes O(n log n) time. Is that true for strings as well? Formally:

The input is a string S[1,n] accessible by comparisons only. Find the median substring, that is, find the substring S[i,n] which has the rank n/2 in the lexicographic ordering of the suffixes S[j,n] for j=1,...,n. This can be solved in O(n log n) time by mapping the alphabet down to [1,n] using sorting and then using an O(n) time suffix tree construction algorithm. This algorithm in fact sorts all the suffixes. Does there exist an O(n) time algorithm for the median substring problem?

Here is another algorithmic problem. Given an array A[1,n] of numbers accessible by comparisons only, it is well known that selection---finding them item of any rank k in the sorted order---takes O(n) time and is easier than sorting which takes O(n log n) time. Is that true for strings as well? Formally:

The input is a string S[1,n] accessible by comparisons only. Find the median substring, that is, find the substring S[i,n] which has the rank n/2 in the lexicographic ordering of the suffixes S[j,n] for j=1,...,n. This can be solved in O(n log n) time by mapping the alphabet down to [1,n] using sorting and then using an O(n) time suffix tree construction algorithm. This algorithm in fact sorts all the suffixes. Does there exist an O(n) time algorithm for the median substring problem?

## 5 Comments:

Interesting problem. I suppose the KMP algorithm can be modified easily enough to compute the rank of a suffix in linear time...

Yes. So, for example, one can pick a random suffix S[i,n] and use the KMP algorithm to figure out which suffixes are ranked below and which above, in O(n) time. Now if one can recurse and solve the correct subproblem in time linear in the number of substrings in it (and not O(n)),....

After O(loglog n) passes of such random pivoting (with KMP modified not just to rank the suffix but to compare it against all other suffixes, so we can tell which ones to eliminate) we'd be down to O(n / log n) or even O(n / log^2 n) suffixes left, few enough that it might be possible to sort them if only we could compare them quickly. The comparisons are easy if we have an LCA structure on a suffix tree, but...

Say we are given a subset S of suffixes. Then a suffix tree can be constructed only on those suffixes in linear time by the result of Andersson, Larsson and Swanson (Suffix tree on words), but it is not clear if that is helpful since they primarily focused on keeping space O(|S|) and did not explicitly work on the comparison model.

A natural start to ranking the suffixes of a string is to grow each suffix one character at a time and grouping them naturally based on identical characters at each step. This has some worst case behavior I am sure, but does it help for finding the lexicographically smallest (rank=1) substring?

This comment has been removed by a blog administrator.

Post a Comment

<< Home