Edit Distance

Edit distance is quite useful to illustrate how we can optimize backtracking via memoization. Here we have two words word1 and word2, and the problem is about finding the minimum number of operations required to convert word1 to word2. Note that only three operations are permitted on the word —  1) Insert a character  2) Delete a character 3) Replace a character.

For example, converting “tea” to “set” would require at least 2 operations.

  1. “tea” -> Replace ‘t’ with ‘s’
  2. “sea” -> Replace ‘a’ with ‘t’
  3. “set”

Calculating this minimum number of operations would mandate that we attempt all the three possible conversion methods at every relevant location. Essentially, whenever there is a character mismatch we attempt to either delete, insert or replace. The whole approach can be visualized as a tree.

btrack

As you can see, the red connections are character delete operations, while green is replace and brown is insert. The algorithm basically runs a depth first search on a graph like above. Once we reach the leaf node, the total number of operations required for that sequence would be known. Recursive C function implementing the idea is given below.

int ml(char* w1, int o1, int l1, char* w2, int o2, int l2)
{
    int od, oi, os, r = 0;

    /* End of the string, return the number of remaining characters
    in the other string */
    if (o1 == l1 || o2 == l2)
        return (o1 == l1) ? l2 - o2 : l1 - o1;

    /* Find minimum distance for strings at [o1..l1] & [o2..l2].
       If characters are unequal, try insert and delete */
    if (w1[o1] != w2[o2])
    {
       od = ml(w1, o1 + 1, l1, w2, o2, l2) + 1;
       oi = ml(w1, o1, l1, w2, o2 + 1, l2) + 1;
       r++;
    }

    /* Explore the case where character is replaced or
       where they are equal. Finally, pick the minimum. */
    os = ml(w1, o1 + 1, l1, w2, o2 + 1, l2) + r;
    return MIN_INT(os, od, oi);
}

int minDistance(char *word1, char * word2)
{
    int l1 = strlen(word1), l2 = strlen(word2), md = 0;
    md = ml(word1, 0, l1, word2, 0, l2);
    return md;
}

Please note that the nodes illustrate only that remaining part of the string which needs to be evaluated.  For example, after deletion of ‘t’, we have [“ea” , “set”]. Also, quite evident that there are several nodes in the above tree which contain the same set of strings. For example, [“ea”, “et”] is repeated thrice. An obvious optimization would be to avoid repeated traversing of these similar sub-trees by using memoization. Basically we cache the result of the first traversal and reuse it during subsequent calls. Now with this optimization we have a much sparser tree structure.

EditDistance

Now the nodes are non-repeating and the implementation with this optimization is given below. Note that the memoization array is indexed based on the combination of the string offsets. For example, the number of operations required to convert sub-string “a” (w1[2]) into “et” (w2[1]) can be stored and later accessed from location dp[2][1].

void ml(char* w1, int o1, int l1, char* w2, int o2, int l2,
        unsigned char *dp)
{
    int offst = o1 * (l2 + 1) + o2, od = (o1 + 1) * (l2 + 1) + o2;
    int oi = o1 * (l2 + 1) + o2 + 1, r = 0;

    /* End of the string, return the number of remaining characters
    in the other string */
    if (o1 == l1 || o2 == l2)
        dp[offst] = (o1 == l1) ? l2 - o2 : l1 - o1;

    /* If the minimum distance for strings at [o1..l1] & [o2..l2] is
       not already calculated, then figure out the same */
    else if (dp[offst] == MAX_CHAR)
    {
        /* Characters are unequal, try insert and delete */
        if (w1[o1] != w2[o2])
        {
            ml(w1, o1 + 1, l1, w2, o2, l2, dp);
            ml(w1, o1, l1, w2, o2 + 1, l2, dp);
            r++;
        }

        /* Explore the case where character is replaced or
           where they are equal. Finally, pick the minimum. */
        ml(w1, o1 + 1, l1, w2, o2 + 1, l2, dp);
        dp[offst] = (r > 0) ? (MIN_INT(dp[od + 1] + r, dp[od] + r,
                    dp[oi] + r)) : dp[od + 1];
    }
}

int minDistance(char *word1, char * word2)
{
    int l1 = strlen(word1), l2 = strlen(word2), md = 0;
    unsigned char *dp = malloc((l1 + 1) * (l2 + 1) * sizeof(unsigned char));

    /* Validate */
    if (!dp) return 0;

    /* Initialize buffer */
    memset(dp, MAX_CHAR, (l1 + 1) * (l2 + 1));

    /* Get the minimum distance */
    ml(word1, 0, l1, word2, 0, l2, dp);
    md = dp[0];
    free(dp);
    return md;
}

Full source code can be accessed at bit bucket link

Advertisements