Edit Distance

Edit distance is quite useful to illustrate how we can optimize backtracking via memoization. Here we have two words word1 and word2, and the problem is about finding the minimum number of operations required to convert word1 to word2. Note that only three operations are permitted on the word —  1) Insert a character  2) Delete a character 3) Replace a character.

For example, converting “tea” to “set” would require at least 2 operations.

  1. “tea” -> Replace ‘t’ with ‘s’
  2. “sea” -> Replace ‘a’ with ‘t’
  3. “set”

Calculating this minimum number of operations would mandate that we attempt all the three possible conversion methods at every relevant location. Essentially, whenever there is a character mismatch we attempt to either delete, insert or replace. The whole approach can be visualized as a tree.

btrack

As you can see, the red connections are character delete operations, while green is replace and brown is insert. The algorithm basically runs a depth first search on a graph like above. Once we reach the leaf node, the total number of operations required for that sequence would be known. Recursive C function implementing the idea is given below.

int ml(char* w1, int o1, int l1, char* w2, int o2, int l2)
{
    int od, oi, os, r = 0;

    /* End of the string, return the number of remaining characters
    in the other string */
    if (o1 == l1 || o2 == l2)
        return (o1 == l1) ? l2 - o2 : l1 - o1;

    /* Find minimum distance for strings at [o1..l1] & [o2..l2].
       If characters are unequal, try insert and delete */
    if (w1[o1] != w2[o2])
    {
       od = ml(w1, o1 + 1, l1, w2, o2, l2) + 1;
       oi = ml(w1, o1, l1, w2, o2 + 1, l2) + 1;
       r++;
    }

    /* Explore the case where character is replaced or
       where they are equal. Finally, pick the minimum. */
    os = ml(w1, o1 + 1, l1, w2, o2 + 1, l2) + r;
    return MIN_INT(os, od, oi);
}

int minDistance(char *word1, char * word2)
{
    int l1 = strlen(word1), l2 = strlen(word2), md = 0;
    md = ml(word1, 0, l1, word2, 0, l2);
    return md;
}

Please note that the nodes illustrate only that remaining part of the string which needs to be evaluated.  For example, after deletion of ‘t’, we have [“ea” , “set”]. Also, quite evident that there are several nodes in the above tree which contain the same set of strings. For example, [“ea”, “et”] is repeated thrice. An obvious optimization would be to avoid repeated traversing of these similar sub-trees by using memoization. Basically we cache the result of the first traversal and reuse it during subsequent calls. Now with this optimization we have a much sparser tree structure.

EditDistance

Now the nodes are non-repeating and the implementation with this optimization is given below. Note that the memoization array is indexed based on the combination of the string offsets. For example, the number of operations required to convert sub-string “a” (w1[2]) into “et” (w2[1]) can be stored and later accessed from location dp[2][1].

void ml(char* w1, int o1, int l1, char* w2, int o2, int l2,
        unsigned char *dp)
{
    int offst = o1 * (l2 + 1) + o2, od = (o1 + 1) * (l2 + 1) + o2;
    int oi = o1 * (l2 + 1) + o2 + 1, r = 0;

    /* End of the string, return the number of remaining characters
    in the other string */
    if (o1 == l1 || o2 == l2)
        dp[offst] = (o1 == l1) ? l2 - o2 : l1 - o1;

    /* If the minimum distance for strings at [o1..l1] & [o2..l2] is
       not already calculated, then figure out the same */
    else if (dp[offst] == MAX_CHAR)
    {
        /* Characters are unequal, try insert and delete */
        if (w1[o1] != w2[o2])
        {
            ml(w1, o1 + 1, l1, w2, o2, l2, dp);
            ml(w1, o1, l1, w2, o2 + 1, l2, dp);
            r++;
        }

        /* Explore the case where character is replaced or
           where they are equal. Finally, pick the minimum. */
        ml(w1, o1 + 1, l1, w2, o2 + 1, l2, dp);
        dp[offst] = (r > 0) ? (MIN_INT(dp[od + 1] + r, dp[od] + r,
                    dp[oi] + r)) : dp[od + 1];
    }
}

int minDistance(char *word1, char * word2)
{
    int l1 = strlen(word1), l2 = strlen(word2), md = 0;
    unsigned char *dp = malloc((l1 + 1) * (l2 + 1) * sizeof(unsigned char));

    /* Validate */
    if (!dp) return 0;

    /* Initialize buffer */
    memset(dp, MAX_CHAR, (l1 + 1) * (l2 + 1));

    /* Get the minimum distance */
    ml(word1, 0, l1, word2, 0, l2, dp);
    md = dp[0];
    free(dp);
    return md;
}

Full source code can be accessed at bit bucket link

Advertisements

Substring with Concatenation of All Words

Bioinformatics could be one of those areas where problems like ‘substring with concatenation of all words ‘ might be relevant. Here we are actually looking for a sequence within a larger string, and this sequence has to be a concatenation of a given set of words, all of equal length. For example, for the string “”wordgoodgoodgoodbestword”” and the given set of words [“word”,”good”,”best”,”good”], the solution should return index [8]. Note that this concatenation of words can happen in any order. Even though the previous example had only one index, the solution should identify all the instances of such sequences. For example, given “aaaaaaaa” and word list [“aa”,”aa”,”aa”], the solution should return indices [0, 1, 2].

 

Substring Sequence
Substring Sequence

 

A brute force approach would involve scanning every offset of the string seeking for matching words within the list. So for a string of length n and a word list of size m containing words of length k, the complexity would be O(n x m x m x k). An improvement over the brute force approach would be to sort the word array, now the time complexity becomes O(n x m x Log(m) x k). So the primary problem is about quickly identifying whether there is a matching word corresponding to a substring at some offset. Seems like a data structure like trie is ideal for such an operation.

 

Trie
Trie for [“word”,”good”,”best”,”good”]

A trie for the list of words [“word”,”good”,”best”,”good”] is illustrated above, we can use this to improve the speed of the sequential scan operation. From the data structure it’s evident that the time complexity for a string search would be O(k). Note that such a tree needs to also handle duplicate words, hence the variable “count” which keeps track of the number of instances of that word within the list. In case of the above example, “good”  happens to have a count equal to 2. Use of trie can further reduce the overall time complexity to O(n x m x k). But the obvious trade-off is the overhead of creating this tree.

Structure of a trie node used in this algorithm is illustrated below.

/********************************/
/* Trie Node */
/********************************/
struct tnode
{
 struct tnode *tc[TRIE_SZ]; // child nodes
 int tcount;                // total number of instances of a word
 int count;                 // decremented during searching
 struct tnode *next;        // linked list node
};

 

Each trie node can have up to 26 children, please note that here we assume the comparison is not case sensitive, otherwise we would need 52. Variable ‘tcount’ records the total number of instances of a word on the list. And the variable ‘count’ is used for keeping track of the number of times a word is encountered on the main string. It is initialized with the same value as ‘tcount’ but decremented during the scan whenever the associated word is found on the main string.

Consider scanning of the string “wordgoodgoodgoodbestword”, ‘count’ for the word “good” would be decremented to 1 as soon as it’s encountered for the first time at the offset 4. So when ‘count’ reaches zero, it actually means that a valid sub-string cannot have any more instances of that word. Here as soon as the algorithms encounters the third instance of word “good” it would realize that the sub-string starting at 0 is invalid, because even though “good” is present in the list its ‘count’ is now zero. So now we want to restart the scanning from the next offset 1.

Before restarting this scan we need to also revert the variable ‘count’ to ‘tcount’. This would mean that we have to keep track of the nodes decremented during the last scan. Here we can utilize the ‘next’ pointer field.

 

Trie Next Pointer
Trie Next Pointer

 

During the previously failed scan which started at offset 0 our algorithm would have used the next pointer to generate a linked list of encountered words. The above example would generate a link from “word” to “good”. Please note that during the scan the word “good” was encountered after “word”, hence the link in that direction. Now we can run through this linked list and quickly reset the counters which were decremented. Obviously this linked list would be dismantled as soon as the reset is done. It’s merely an optimization to quickly revert the trie to its original pristine state.

Code used for creating and searching a trie is given below:

/***********************************************************************/
/* add: Add a new string                                               */
/*                                                                     */
/***********************************************************************/
struct tnode *add(struct tnode *node, char *str, struct mstack *m)
{
    int index = CHAR_TO_OFFST(*str);

    /* If the node does not exist, then allocate the same */
    if (!node)
    {
        /* Allocate */
        node = m_alloc(m, sizeof(struct tnode));
        if (!node)
            return NULL;
    }

    /* Continue populating the trie */
    if (*str != '\0')
        node->tc[index] = add(node->tc[index], str + 1, m);

    /* If this is the last character, then set the count */
    else
       node->tcount = ++node->count;

    /* Return the node */
    return node;
}

/***********************************************************************/
/* search: Search for a string                                         */
/*                                                                     */
/***********************************************************************/
struct tnode *search(struct tnode *node, char *str)
{
    int index = CHAR_TO_OFFST(*str);

    /* If this is the last character of the string,
    then return node */
    if (*str == '\0')
    {
        if (node->tcount > 0)
            return node;
        else
            return NULL;
    }

    /* Ensure the character is valid and it's present in the trie */
    if (node->tc[index] == NULL)
        return NULL;

    /* Continue the search */
    return search(node->tc[index], str + 1);
}

/***********************************************************************/
/* create_trie: Create a trie                                          */
/*                                                                     */
/***********************************************************************/
struct tnode *create_trie(char **ch, int wlen, struct mstack *m)
{
    struct tnode *r = NULL;
    int i;

    /* Allocate the trie */
    for (i = 0; i < wlen; ++i)
    {
        /* If the root is not set, then initialize the same */
        if (!r)
            r = add(NULL, ch[i], m);

        /* Else continue the allocation */
        else if (!add(r, ch[i], m))
            return NULL;
    }

    /* Return the root */
    return r;
}

 

The function which implements the main logic is given below.

/***********************************************************************/
/* You are given a string, s, and a list of words, words, that are all */
/* of the same length. Find all starting indices of substring(s) in s  */
/* that is a concatenation of each word in words exactly once and      */
/* without any intervening characters.                                 */
/* For example, given:                                                 */
/* s: "barfoothefoobarman"                                             */
/* words: ["foo", "bar"]                                               */
/*                                                                     */
/* You should return the indices: [0,9].                               */
/* (order does not matter).                                            */
/*                                                                     */
/***********************************************************************/
int* findSubstring(char* s, char** words, int wordsSize, int *returnSize)
{
    struct tnode *root = NULL, *head = NULL, *node = NULL;;
    int wlen, slen, i, j, tw = 0, wsize, s_end, arr_off = 0;
    int *ret_arr = NULL;
    struct mstack m = {NULL};
    struct tnode **m_arr = NULL;
    struct tnode dummy;

    /* Maintain Sanity */
    *returnSize = 0;
    if (!s || !words || !wordsSize || !returnSize)
        return NULL;

    /* Calculate the input array size */
    wlen = strlen(words[0]);
    slen = strlen(s);

    /* Initialize variables */
    wsize = wlen * wordsSize;
    s_end = slen - wsize;

    /* Initialize dummy to zero */
    dummy.count = dummy.tcount = 0;

    /* Allocate a trie for the array */
    if (s_end >= 0)
    {
        /* Allocate memory for simple allocator */
        if (!m_stack_alloc(&m, SEGMENT_COUNT, SEGMENT_SIZE))
            goto subs_exit;

        /* Memoization Array */
        m_arr = calloc(slen, sizeof(struct tnode *));
        if (!m_arr)
            goto subs_exit;

        /* Create trie */
        root = create_trie(words, wordsSize, &m);
        if (!root)
            goto subs_exit;
    }

    /* Loop as long as there is a possibility for a match */
    for (i = 0; i <= s_end; ++i)
    {
        /* Loop checking whether the substring at this location is a
        concatenation of the words */
        for (j = i; j <= slen - wlen; j += wlen)
        {
            char c = s[j + wlen];
            struct tnode *tn = m_arr[j];

            /* If there is no hit, then search the trie */
            if (!tn)
            {
                /* Create a NULL terminating condition */
                s[j + wlen] = '\0';

                /* If the word is not found then, set the value to
                dummy */
                if ((tn = search(root, &s[j])) == NULL)
                    tn = &dummy;

                /* Replace the character */
                s[j + wlen] = c;

                /* Assign the pointer to the memoization array */
                m_arr[j] = tn;
            }

            /* If it's not found, then break */
            if (!tn->count)
                break;

            /* Decrement the count */
            --tn->count;

            /* Initiate the linked list head */
            if (!head)
                node = head = tn;

            /* Add the new node only if it's not already a part of the
            list */
            else if ((!tn->next) && (node != tn))
            {
                node->next = tn;
                node = tn;
            }

            /* Increment the hit count */
            tw++;

            /* If all the words were found, then break*/
            if (tw == wordsSize)
            {
                /* Make the necessary dynamic allocation */
                if ((!ret_arr) && ((ret_arr = malloc(sizeof(int) * slen)) == NULL))
                    goto subs_exit;

                /* Save the offset, increment the index and break */
                ret_arr[arr_off] = i;
                arr_off++;
                break;
            }
        }

        /* Reset the list */
        if (head)
        {
            reset_list(head);
            head = NULL;
            tw = 0;
        }
    }

    /* Free the trie, memoization array, assign the return size */
subs_exit:
    m_stack_free(&m);
    if (m_arr)
        free(m_arr);
    *returnSize = arr_off;
    return ret_arr;
}

As you can see the algorithm also implements memoization related optimization. Other dynamic memory allocation related changes and rest of the code can be accessed from the Bitbucket repo 

Largest Rectangle in Histogram

Largest rectangle in a histogram is an interesting puzzle and can be solved in several different ways. One of the most popular O(n) method involves usage of stack, another one employs memoization to improve on the brute force approach, and then there is also a slightly slower O(nLog(n)) method employing segment tree. Evidently this problem seems quite useful for illustrating some of the data structure and programming concepts.

Below diagram is an illustration of such a histogram represented by an array with heights { 2, 1 ,5, 6, 2, 3}.

Largest Rectangle in Histogram
Largest Rectangle in Histogram

The Problem

We are essentially seeking the largest rectangle within a histogram. And the two variables determining the area of such a rectangle would be the minimum height among the included bars and the total number of bars. For example, in the above histogram the maximum area would be formed by two bars at the offset 2 and 3 with a minimum height equal to 5, so 5 x 2 = 10 .

Area = Minimum_Height x Number of bars

The search for that elusive maximum area requires us to scan all the possible rectangles in the histogram. This can either happen by picking a bar and then seeking the maximum range where its own height is the minimum. Or by picking a range and then finding the minimum height. In other words, we pick a value for one variable in the above equation and then proceed to find a valid measurement for the other one.

O(n) Approach

If we sequentially pick each element and then locate the smaller bar to the left and right, then we would have the width of the rectangle corresponding to each height/bar. In this manner we can eventually figure out the largest rectangle. A brute force O(n^2) approach would simply involve two nested loops, outer loop to sequentially pick the element and the inner loop to scan the values to the left and right of that selected bar.

Maximum area measurements for each bar
Maximum area measurements for each bar

The above diagram illustrates the basic idea behind this algorithm. Performance would eventually depend on how quickly we can identify this width corresponding to each rectangle. We can explore how the usage of stack and memoization can essentially improve the time taken for this scan.

Stack: The idea is to push array elements into a stack as long as the present top of the stack is shorter than the element being pushed. If the stack top is larger than the element, then the present element provides the right hand side offset for the width, so now we can calculate the area of the rectangle with top element as the height. Note that the left offset can be identified by checking the stack element right below the top. After calculating the area, pop the stack. Repeat this operation as long as the top of the stack is larger than the present element.  So at any instant the stack would only contain those elements whose left offset is identified but not the right.

Stack Illustration 1
Stack Illustration 1

 

Stack Illustration 2
Stack Illustration 2

The above diagrams illustrate how the stack operations would scan the array { 2, 1, 5, 6, 2, 3}. As you can see, space complexity would also be O(n).

    /* Push the first element into the stack */
    spush(&s, &a[0]);
    b = a[0];

    /* Scan all the elements */
    while (i < n)
    {
        /* The bar is higher than stack top, push
        into the stack */
        if (a[i] >= b)
        {
            spush(&s, &a[i]);
            b = a[i];
        }

        /* Else we need to calculate the area spanned by the
        bar on the top of the stack. */
        else
        {
            /* Peek while the stack top is less than the present
            bar */
            while ((speek(&s, &offst, &b) == 0) && (b > a[i]))
            {
                int loffst, ht;

                /* Pop and then peek into the stack to get the
                left offset */
                spop (&s, &offst, &b);
                speek(&s, &loffst, &ht);

                /* Update the max rectangle */
                if (max_rectangle < (b * ((i - loffst - 1))))
                    max_rectangle = b * (i - loffst - 1);
            }
            /* Push the bar into the stack */
            spush(&s, &a[i]);
            b = a[i];
        }
        /* Move to the next bar */
        ++i;
    }

    /* Pop the remaining bars */
    while (spop(&s, &offst, &b) == 0)
    {
        int loffst, ht;

        /* Update the area */
        speek(&s, &loffst, &ht);
        if (max_rectangle < (b * (i - loffst - 1)))
            max_rectangle = b * (i - loffst - 1);
    }
    /* Return maximum area */
    free(s.s);
    return max_rectangle;

Relevant piece of code is provided above, for more details please refer to the source code repository link.

Memoization: A brute force approach to this problem would require us to scan the left and right sub-arrays for each element looking for the relevant shorter height. For example, the bar with height 1 in the above example would force scanning of the whole array. In that sense if all the bars were of equal height, then the brute force algorithm would always force n^2 comparisons. Memoization involves caching and utilizing the already scanned information for subsequent use. Seems like that approach is an ideal candidate here.

Memoization
Memoization

The above diagram illustrates how the relevant smaller right offset for each element can be recorded by scanning the array in the reverse order. Please note that the curved arrow pointers signify comparisons within the inner loop. The crucial aspect is that by using the already recorded offsets we can leapfrog certain locations which are not relevant for that element.

For example, with this new method, generation of offset values until the location 2 (having value 5) does not see any reduction in number of comparisons. But the calculation of right offset for value 1 at the array location 1 was reduced by less than half because of the already cached offset value. How? We know that the value 5 is greater than 1, and we also know the offset of the value which is smaller than 5 (ie: 2 at location 4). So, we can conveniently jump to location 4  and check whether the value at that offset (ie:2) is less than 1, the search ends if it is smaller, else we merely jump to the next recorded offset (ie:6, end of list).

Similarly the smaller left offset values can also be generated by scanning the array, but this time in the order from the start. With the allocated additional space for memoization, we manage to short-circuit the comparison sequence. Once we have the left and right offsets for all the elements, we can simply run through the array and calculate the maximum area.

   /* First element has no left smaller element, last element has no
    right smaller */
    smaller_left[0] = -1;
    smaller_right[n - 1] = n;
    for (i = 1; i < n; ++i)
    {
        /* Best case */
        if (a[i] > a[i - 1])
            smaller_left[i] = i - 1;

        /* Else scan the array */
        else
        {
            int x = i - 1;
            /* Look for the next smaller element towards the left */
            while ((x >= 0) && (a[x] >= a[i]))
                x = smaller_left[x];

            /* Found the offset for the next smaller element */
            smaller_left[i] = x;
        }
    }


    /* Generate the offsets for smaller elements to the right */
    for (i = n - 2; i >= 0; --i)
    {
        /* Best case */
        if (a[i + 1] < a[i])
            smaller_right[i] = i + 1;

        /* Else scan the array */
        else
        {
            int x = i + 1;

            /* Loop as long as the values at the offsets are greater
            than a[i] */
            while ((x < n) && (a[x] >= a[i]))
                x = smaller_right[x];

            /* Smaller to the right */
            smaller_right[i] = x;
        }
    }

    /* O(n) scan to get the maximum area */
    for (i = 0; i < n; i++)
    {
        int mi = a[i] * (smaller_right[i] - smaller_left[i] - 1);
        if (max_rectangle < mi)
            max_rectangle = mi;
    }

    /* Return maximum area */
    return max_rectangle;

Essential piece of code is given above, for more details please refer to the source code repository link.

O(nLog(n)

The third method is slightly more elaborate, it involves a segment tree. Since this data structure can be used for quickly finding the minimum value within a sub-array, we can use that property to implement a divide and conquer algorithm. In other words, armed with a segment tree hammer, this problem starts to look like a nail. The point is by using a segment tree the minimum height within any range can be found in logarithmic time . So here our approach would involve dividing the histogram while calculating the minimum height within each such divisions.

So first we start with the whole histogram, basically we pick the maximum width and then find the corresponding height. Then we divide the histogram based on the offset of the bar with this minimum height and recursively repeat the same operation on each sub-division. The idea is illustrated in the below diagram.

 

Divide and Conquer
Divide and Conquer
int seg_tree(int *a, int *arr, int s, int e, int len)
{
    int min, lmax, rmax, mmax;

    /* Recursion termination */
    if (s > e)
        return 0;

    /* Get the minimum value offset */
    min = search_sg_tree(arr, a, 0, s, e, 0, len - 1);

    /* Calculate the area with the minimum value within this offset */
    mmax = a[min] * (e - s + 1);

    /* Go to the left subtree */
    lmax = seg_tree(a, arr, s, min - 1, len);

    /* Go to the right subtree */
    rmax = seg_tree(a, arr, min + 1, e, len);

    /* Return the max */
    return (((mmax > lmax) ? mmax : lmax) > rmax) ?
            ((mmax > lmax) ? mmax : lmax) : rmax;
}

As you can see, depth of the recursive calls in the above divide and conquer method eventually depends on how the array actually gets divided during each call. If all the values are equal, then it will lead to the worst case scenario where the call stack depth is equivalent to n. In other words, the call stack memory usage complexity is O(n).

Segment Tree used for the above algorithm is given below:

Segment Tree
Segment Tree

 

We can implement such a segment tree using an array data structure, address of the left and right children can be easily generated using the formula (Parent_Offset * 2 + 1) &  (Parent_Offset * 2 + 2), quite like a heap. Below figure illustrates the above segment tree when it’s structured like an array.

Segment Tree Array
Segment Tree Array

As you can see, some of the locations remain unused. A tree of depth 4 can hold upto 16 elements, but this array requires only 11 locations. Generation of a segment tree is also quite similar to how an array would be converted to binary tree. Basically, we recursively generate the sub-trees while allowing the smaller child to rise to the top. Within such a tree, the smallest sub-tree would be the leaf nodes which would hold the array values and that also determines the terminating condition for recursion.

/***********************************************************************/
/* create_sg_tree: Create a segment tree from an array                 */
/*                                                                     */
/***********************************************************************/
int create_sg_tree(int *arr, int soff, int s, int e, int *a, int len)
{
    int mid;

    /* We are at the leaf node */
    if (s == e)
        return s;

    /* Calculate the middle of the array */
    mid = s + ((e - s + 1) / 2) - 1;

    /* Set the left subtree root node */
    arr[(soff * 2) + 1] = create_sg_tree(arr, (soff * 2) + 1, s,
                                         mid, a, len);

    /* Set the right subtree root node */
    arr[(soff * 2) + 2] = create_sg_tree(arr, (soff * 2) + 2,
                                         mid + 1, e, a, len);

    /* Allow the smaller value to rise to the top */
    return a[arr[(soff * 2) + 1]] > a[arr[(soff * 2) + 2]] ?
           arr[(soff * 2) + 2] : arr[(soff * 2) + 1];
}

For more details please refer to the source code repository.

Generation of the segment tree is O(Log(n)), and we sequentially pick n elements and find the smallest within the relevant range with Log(n) complexity. Overall O(nLog(n)).