## Monday, February 18, 2008

### A String Problem

In string matching, there is a nice open problem from Gadi Landau that seems to need a chunk of combinatorics:

Input: A test string T of length n, may be preprocessed.
Query: A pattern string P of length m. Output is all locations i such that T[i,...,i+m-1] \equiv to P[1,...,m], where two strings s and t are \equiv if there is a permutation of s that equals t.

What is the closest to O(n) preprocessing and |P| query processing one can get? Notice that |P| may be smaller than m, since P can be presented as a vector of characters and their frequencies.

Inbok said...

I may misunderstand the problem, but can we move a window of size |P| from left to right? For each symbol count the number of occurrences, and maintain a variable which counts the number of counters which match the number of symbols in P. Each time we move the window to the right by one character, at most three counters will be changed. No preprocessing is required and the running time will be O(n+m).

7:05 PM
Anonymous said...

What you propose is an approach for solving the problem when text and the pattern are given together. The open problem is the indexing (or aka online) version where one wants query processing time to be o(n). You can spend ~\tilde(O)(n) time preprocessing the text once for all, but after that *each* query should be processed in time a function of |P| and something that is either independent of n, or if it can't be avoided, a mild function of n like log n.
-- metoo

9:02 PM
Inbok said...

After writing the comment I found what I misunderstood. :(

10:24 PM
Rasmus said...

A simple solution would be to take the set of sorted substrings T[i,...,i+m-1] and store a hash value for each sorted substring. To look up P, you sort it, hash, and look for the hash value in the stored set. This of course assumes that m is known when constructing the data structure.

11:59 PM
Anonymous said...

Yes, in the indexing problem, alas, m is known only at the query time.
-- metoo

4:17 AM
Anonymous said...

Will the answer also depend on the frequency of occurrence of the string P in the string T?

Assuming finite alphabet size, one way might be to first construct the suffix array of the string T, sort them and store the locations in an array. For m=1, we can then directly look up from the array. For m=2, we can now proceed by shortlisting the two candidate suffix sets and then look at the second alphabet in these sets and so on. Though, I am not sure of the complexity of sorting of teh suffix array (the preprocessing)

7:53 AM
Anonymous said...

The approach of progressively refining some set of candidate positions as you propose may work in practice. But it may take O(n) time or more in simple examples and it is not clear that is needed. Say T = (acb)^n, and pattern you are looking for is (ab). Both a and b have O(n) sized lists, but their adjusted intersection has none.
-- metoo

3:54 AM
Anonymous said...

If I understand the notation correctly the string T is given by

T=acbacbacbacb.....

Since the pattern we are looking for is ab there will be O(n) occurrences of P in T as there will be O(n) occurrences of ba.

This is what I was referring to when I was wondering if the answer also depends on the frequency of occurrence of the string P in the string T?

7:44 AM
Anonymous said...

Sorry, let T=(acbd)^n.
-- metoo

2:17 PM