Yashica Patodia

I am a CSE pre-final undergraduate at IIT KGP .I am an open source enthusiast and love doing cp.

DISTINCT SUBSTRINGS OF A STRING

28 Aug 2020 »

First Approach- BRUTE FORCE

Complexity O(n^2(n+logn))

The first approach which comes to mind is brute force .In this approach we are using a set to store all the distinct substrings. We can convert this complexity to n^3 by using an array instead of a set .(Insert operation in set is causing the logn factor)

string str;
	cin>>str;
	int n=str.length();
	set<string>st;
	for(int len=1;len<=n;len++)
	{
		for(int l=0;l<=n-len;l++)
		{
			int r=len+l-1;
			string s=str.substr(l,len);
			st.insert(s);
			
		}
	}
	cout<<st.size()<<endl;

Second Approach-KNUTH-MORRIS-PRATT ALGORITHM

Complexity O(n^2)

We will solve this problem iteratively. Namely we will learn, knowing the current number of different substrings, how to recompute this count by adding a character to the end.

So let k be the current number of different substrings in s, and we add the character c to the end of s. Obviously some new substrings ending in c will appear. We want to count these new substrings that didn’t appear before.

We take the string t=s+c and reverse it. Now the task is transformed into computing how many prefixes there are that don’t appear anywhere else. If we compute the maximal value of the prefix function πmax of the reversed string t, then the longest prefix that appears in s is πmax long. Clearly also all prefixes of smaller length appear in it.

Therefore the number of new substrings appearing when we add a new character c iss+1−π(max).

So for each character appended we can compute the number of new substrings in O(n) times, which gives a time complexity of O(n2) in total.

vector<int> prefix_function(string s) {
    int n = (int)s.length();
    vector<int> pi(n);
    for (int i = 1; i < n; i++) {
        int j = pi[i-1];
        while (j > 0 && s[i] != s[j])
            j = pi[j-1];
        if (s[i] == s[j])
            j++;
        pi[i] = j;
    }
    return pi;
}
string str;
	cin>>str;
	int n=str.length();
	int ans=0; 
	string sub=str.substr(0,1);
	ans++;//for string of length 1
	for(int i=1;i<n;i++)
	{
		sub+=str[i];
		reverse(sub.begin(),sub.end());
		vector<int>pi=prefix_function(sub);
		int mx=*max_element(pi.begin(),pi.end());
		reverse(sub.begin(),sub.end());
		ans+=i+1-mx;
	}
	cout<<ans<<endl;
	

Third Approach- USING SUFFIX TRIES

Complexity- O(n^2)

The idea is create a Trie of all suffixes of given string called the Suffix Trie. Once the Trie is constricted, our answer is total number of nodes in the constructed Trie.

int ans;
const int sigma=26;
typedef struct node
{
	node *next[sigma];
	int wend;
	int prefix;
	node()
	{
		wend=0;
		prefix=0;
		for(int i=0;i<sigma;i++)
			next[i]=NULL;
	}
}trie;
trie *head;
void insert(string &s)
{
	trie *cur=head;
	for(int i=0;i<s.length();i++)
	{
		if(cur->next[s[i]-'a']==NULL)
		{
			cur->next[s[i]-'a']=new trie();
			ans++;
			
		}
		cur=cur->next[s[i]-'a'];
		cur->prefix++;
		

	}
	cur->wend++;
	
}
string str;
	cin>>str;
	int n=str.length();
	head =new trie();
	ans=0;
	string sub=str.substr(n-1,1);
	insert(sub);
	for(int i=n-2;i>=0;i--)
	{
		sub=str[i]+sub;
		insert(sub);

	}
	cout<<ans<<endl;
	

Fourth Approach- USING SUFFIX ARRAYS

Complexity -O(nlogn)

This is the most optimised approach of finding the number of distinct substrings.

A suffix array is a sorted array of all suffixes of a given string.After finding the suffix array we need to construct lcp(longest common prefix) of the array. LCP is basically the longest coomon prefix of two consecutive strings.LCP[0] is not defined and is generally taken as 0. We can construct the suffix array in O(nlogn) time complexity and the lcp in O(n) using Kasai’s Algorithm.

We are going to sort cyclic shifts, we will consider cyclic substrings. We will use the notation s[i…j] for the substring of s even if i>j. In this case we actually mean the string s[i…n−1]+s[0…j]. In addition we will take all indices modulo the length of s, and will omit the modulo operation for simplicity. In each iteration of the algorithm, in addition to the permutation p[0…n−1], where p[i] is the index of the i-th substring (starting at i and with length 2k) in the sorted order, we will also maintain an array c[0…n−1], where c[i] corresponds to the equivalence class to which the substring belongs. At the beginning (in the 0-th iteration) we must sort the cyclic substrings of length 1, that is we have to sort all characters of the string and divide them into equivalence classes (same symbols get assigned to the same class). This can be done trivially, for example, by using counting sort. We use here the technique on which radix sort is based: to sort the pairs we first sort them by the second element, and then by the first element (with a stable sort, i.e. sorting without breaking the relative order of equal elements). However the second elements were already sorted in the previous iteration. We preprocess the string s by computing the suffix array and the LCP array. Using this information we can compute the number of different substrings in the string.

Because the suffixes are sorted, it is clear that the current suffix p[i] will give new substrings for all its prefixes, except for the prefixes that coincide with the suffix p[i−1]. Thus, all its prefixes except the first lcp[i−1] one. Since the length of the current suffix is n−p[i], n−p[i]−lcp[i−1] new suffixes start at p[i]. Summing over all the suffixes, we get the final answer:

 vector<int> sort_cyclic_shifts(string const& s) {
    int n = s.size();
    const int alphabet = 256;
    vector<int> p(n), c(n), cnt(max(alphabet, n), 0);
    for (int i = 0; i < n; i++)
        cnt[s[i]]++;
    for (int i = 1; i < alphabet; i++)
        cnt[i] += cnt[i-1];
    for (int i = 0; i < n; i++)
        p[--cnt[s[i]]] = i;
    c[p[0]] = 0;
    int classes = 1;
    for (int i = 1; i < n; i++)
     {
        if (s[p[i]] != s[p[i-1]])
            classes++;
        c[p[i]] = classes - 1;
    }
    vector<int> pn(n), cn(n);
    for (int h = 0; (1 << h) < n; ++h) {
        for (int i = 0; i < n; i++) {
            pn[i] = p[i] - (1 << h);
            if (pn[i] < 0)
                pn[i] += n;
        }
        fill(cnt.begin(), cnt.begin() + classes, 0);
        for (int i = 0; i < n; i++)
            cnt[c[pn[i]]]++;
        for (int i = 1; i < classes; i++)
            cnt[i] += cnt[i-1];
        for (int i = n-1; i >= 0; i--)
            p[--cnt[c[pn[i]]]] = pn[i];
        cn[p[0]] = 0;
        classes = 1;
        for (int i = 1; i < n; i++) {
            pair<int, int> cur = {c[p[i]], c[(p[i] + (1 << h)) % n]};
            pair<int, int> prev = {c[p[i-1]], c[(p[i-1] + (1 << h)) % n]};
            if (cur != prev)
                ++classes;
            cn[p[i]] = classes - 1;
        }
        c.swap(cn);
    }
    return p;
}

vector<int> suffix_array_construction(string s) {
    s += "$";
    vector<int> sorted_shifts = sort_cyclic_shifts(s);
    sorted_shifts.erase(sorted_shifts.begin());
    return sorted_shifts;
}
vector<int> lcp_construction(string const& s, vector<int> const& p) {
    int n = s.size();
    vector<int> rank(n, 0);
    for (int i = 0; i < n; i++)
        rank[p[i]] = i;

    int k = 0;
    vector<int> lcp(n-1, 0);
    for (int i = 0; i < n; i++) {
        if (rank[i] == n - 1) {
            k = 0;
            continue;
        }
        int j = p[rank[i] + 1];
        while (i + k < n && j + k < n && s[i+k] == s[j+k])
            k++;
        lcp[rank[i]] = k;
        if (k)
            k--;
    }
    return lcp;
}
int main() 
{
	string str;
	cin>>str;
	int n=str.length();
	vector<int>suff=suffix_array_construction(str);
	vector<int>lcp=lcp_construction(str,suff);
	int p=n*(n+1)/2;
	int sum=0;
	for(int i=1;i<n;i++)
		sum+=lcp[i];
	cout<<p-sum<<endl;
}

We can see how we can optimise an algorithm from O(n^3) to O(nlogn)! Isnt this just magical 😊.

References

Suffix Array Suffix Tries