Substring with Concatenation of All Words

Bioinformatics could be one of those areas where problems like ‘substring with concatenation of all words ‘ might be relevant. Here we are actually looking for a sequence within a larger string, and this sequence has to be a concatenation of a given set of words, all of equal length. For example, for the string “”wordgoodgoodgoodbestword”” and the given set of words [“word”,”good”,”best”,”good”], the solution should return index [8]. Note that this concatenation of words can happen in any order. Even though the previous example had only one index, the solution should identify all the instances of such sequences. For example, given “aaaaaaaa” and word list [“aa”,”aa”,”aa”], the solution should return indices [0, 1, 2].

 

Substring Sequence
Substring Sequence

 

A brute force approach would involve scanning every offset of the string seeking for matching words within the list. So for a string of length n and a word list of size m containing words of length k, the complexity would be O(n x m x m x k). An improvement over the brute force approach would be to sort the word array, now the time complexity becomes O(n x m x Log(m) x k). So the primary problem is about quickly identifying whether there is a matching word corresponding to a substring at some offset. Seems like a data structure like trie is ideal for such an operation.

 

Trie
Trie for [“word”,”good”,”best”,”good”]

A trie for the list of words [“word”,”good”,”best”,”good”] is illustrated above, we can use this to improve the speed of the sequential scan operation. From the data structure it’s evident that the time complexity for a string search would be O(k). Note that such a tree needs to also handle duplicate words, hence the variable “count” which keeps track of the number of instances of that word within the list. In case of the above example, “good”  happens to have a count equal to 2. Use of trie can further reduce the overall time complexity to O(n x m x k). But the obvious trade-off is the overhead of creating this tree.

Structure of a trie node used in this algorithm is illustrated below.

/********************************/
/* Trie Node */
/********************************/
struct tnode
{
 struct tnode *tc[TRIE_SZ]; // child nodes
 int tcount;                // total number of instances of a word
 int count;                 // decremented during searching
 struct tnode *next;        // linked list node
};

 

Each trie node can have up to 26 children, please note that here we assume the comparison is not case sensitive, otherwise we would need 52. Variable ‘tcount’ records the total number of instances of a word on the list. And the variable ‘count’ is used for keeping track of the number of times a word is encountered on the main string. It is initialized with the same value as ‘tcount’ but decremented during the scan whenever the associated word is found on the main string.

Consider scanning of the string “wordgoodgoodgoodbestword”, ‘count’ for the word “good” would be decremented to 1 as soon as it’s encountered for the first time at the offset 4. So when ‘count’ reaches zero, it actually means that a valid sub-string cannot have any more instances of that word. Here as soon as the algorithms encounters the third instance of word “good” it would realize that the sub-string starting at 0 is invalid, because even though “good” is present in the list its ‘count’ is now zero. So now we want to restart the scanning from the next offset 1.

Before restarting this scan we need to also revert the variable ‘count’ to ‘tcount’. This would mean that we have to keep track of the nodes decremented during the last scan. Here we can utilize the ‘next’ pointer field.

 

Trie Next Pointer
Trie Next Pointer

 

During the previously failed scan which started at offset 0 our algorithm would have used the next pointer to generate a linked list of encountered words. The above example would generate a link from “word” to “good”. Please note that during the scan the word “good” was encountered after “word”, hence the link in that direction. Now we can run through this linked list and quickly reset the counters which were decremented. Obviously this linked list would be dismantled as soon as the reset is done. It’s merely an optimization to quickly revert the trie to its original pristine state.

Code used for creating and searching a trie is given below:

/***********************************************************************/
/* add: Add a new string                                               */
/*                                                                     */
/***********************************************************************/
struct tnode *add(struct tnode *node, char *str, struct mstack *m)
{
    int index = CHAR_TO_OFFST(*str);

    /* If the node does not exist, then allocate the same */
    if (!node)
    {
        /* Allocate */
        node = m_alloc(m, sizeof(struct tnode));
        if (!node)
            return NULL;
    }

    /* Continue populating the trie */
    if (*str != '\0')
        node->tc[index] = add(node->tc[index], str + 1, m);

    /* If this is the last character, then set the count */
    else
       node->tcount = ++node->count;

    /* Return the node */
    return node;
}

/***********************************************************************/
/* search: Search for a string                                         */
/*                                                                     */
/***********************************************************************/
struct tnode *search(struct tnode *node, char *str)
{
    int index = CHAR_TO_OFFST(*str);

    /* If this is the last character of the string,
    then return node */
    if (*str == '\0')
    {
        if (node->tcount > 0)
            return node;
        else
            return NULL;
    }

    /* Ensure the character is valid and it's present in the trie */
    if (node->tc[index] == NULL)
        return NULL;

    /* Continue the search */
    return search(node->tc[index], str + 1);
}

/***********************************************************************/
/* create_trie: Create a trie                                          */
/*                                                                     */
/***********************************************************************/
struct tnode *create_trie(char **ch, int wlen, struct mstack *m)
{
    struct tnode *r = NULL;
    int i;

    /* Allocate the trie */
    for (i = 0; i < wlen; ++i)
    {
        /* If the root is not set, then initialize the same */
        if (!r)
            r = add(NULL, ch[i], m);

        /* Else continue the allocation */
        else if (!add(r, ch[i], m))
            return NULL;
    }

    /* Return the root */
    return r;
}

 

The function which implements the main logic is given below.

/***********************************************************************/
/* You are given a string, s, and a list of words, words, that are all */
/* of the same length. Find all starting indices of substring(s) in s  */
/* that is a concatenation of each word in words exactly once and      */
/* without any intervening characters.                                 */
/* For example, given:                                                 */
/* s: "barfoothefoobarman"                                             */
/* words: ["foo", "bar"]                                               */
/*                                                                     */
/* You should return the indices: [0,9].                               */
/* (order does not matter).                                            */
/*                                                                     */
/***********************************************************************/
int* findSubstring(char* s, char** words, int wordsSize, int *returnSize)
{
    struct tnode *root = NULL, *head = NULL, *node = NULL;;
    int wlen, slen, i, j, tw = 0, wsize, s_end, arr_off = 0;
    int *ret_arr = NULL;
    struct mstack m = {NULL};
    struct tnode **m_arr = NULL;
    struct tnode dummy;

    /* Maintain Sanity */
    *returnSize = 0;
    if (!s || !words || !wordsSize || !returnSize)
        return NULL;

    /* Calculate the input array size */
    wlen = strlen(words[0]);
    slen = strlen(s);

    /* Initialize variables */
    wsize = wlen * wordsSize;
    s_end = slen - wsize;

    /* Initialize dummy to zero */
    dummy.count = dummy.tcount = 0;

    /* Allocate a trie for the array */
    if (s_end >= 0)
    {
        /* Allocate memory for simple allocator */
        if (!m_stack_alloc(&m, SEGMENT_COUNT, SEGMENT_SIZE))
            goto subs_exit;

        /* Memoization Array */
        m_arr = calloc(slen, sizeof(struct tnode *));
        if (!m_arr)
            goto subs_exit;

        /* Create trie */
        root = create_trie(words, wordsSize, &m);
        if (!root)
            goto subs_exit;
    }

    /* Loop as long as there is a possibility for a match */
    for (i = 0; i <= s_end; ++i)
    {
        /* Loop checking whether the substring at this location is a
        concatenation of the words */
        for (j = i; j <= slen - wlen; j += wlen)
        {
            char c = s[j + wlen];
            struct tnode *tn = m_arr[j];

            /* If there is no hit, then search the trie */
            if (!tn)
            {
                /* Create a NULL terminating condition */
                s[j + wlen] = '\0';

                /* If the word is not found then, set the value to
                dummy */
                if ((tn = search(root, &s[j])) == NULL)
                    tn = &dummy;

                /* Replace the character */
                s[j + wlen] = c;

                /* Assign the pointer to the memoization array */
                m_arr[j] = tn;
            }

            /* If it's not found, then break */
            if (!tn->count)
                break;

            /* Decrement the count */
            --tn->count;

            /* Initiate the linked list head */
            if (!head)
                node = head = tn;

            /* Add the new node only if it's not already a part of the
            list */
            else if ((!tn->next) && (node != tn))
            {
                node->next = tn;
                node = tn;
            }

            /* Increment the hit count */
            tw++;

            /* If all the words were found, then break*/
            if (tw == wordsSize)
            {
                /* Make the necessary dynamic allocation */
                if ((!ret_arr) && ((ret_arr = malloc(sizeof(int) * slen)) == NULL))
                    goto subs_exit;

                /* Save the offset, increment the index and break */
                ret_arr[arr_off] = i;
                arr_off++;
                break;
            }
        }

        /* Reset the list */
        if (head)
        {
            reset_list(head);
            head = NULL;
            tw = 0;
        }
    }

    /* Free the trie, memoization array, assign the return size */
subs_exit:
    m_stack_free(&m);
    if (m_arr)
        free(m_arr);
    *returnSize = arr_off;
    return ret_arr;
}

As you can see the algorithm also implements memoization related optimization. Other dynamic memory allocation related changes and rest of the code can be accessed from the Bitbucket repo 

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s