Longest Common Subsequence

"""
LCS Problem Statement: Given two sequences, find the length of longest subsequence
present in both of them.  A subsequence is a sequence that appears in the same relative
order, but not necessarily continuous.
Example:"abc", "abg" are subsequences of "abcdefgh".
"""


def longest_common_subsequence(x: str, y: str):
    """
    Finds the longest common subsequence between two strings. Also returns the
    The subsequence found

    Parameters
    ----------

    x: str, one of the strings
    y: str, the other string

    Returns
    -------
    L[m][n]: int, the length of the longest subsequence. Also equal to len(seq)
    Seq: str, the subsequence found

    >>> longest_common_subsequence("programming", "gaming")
    (6, 'gaming')
    >>> longest_common_subsequence("physics", "smartphone")
    (2, 'ph')
    >>> longest_common_subsequence("computer", "food")
    (1, 'o')
    >>> longest_common_subsequence("", "abc")  # One string is empty
    (0, '')
    >>> longest_common_subsequence("abc", "")  # Other string is empty
    (0, '')
    >>> longest_common_subsequence("", "")  # Both strings are empty
    (0, '')
    >>> longest_common_subsequence("abc", "def")  # No common subsequence
    (0, '')
    >>> longest_common_subsequence("abc", "abc")  # Identical strings
    (3, 'abc')
    >>> longest_common_subsequence("a", "a")  # Single character match
    (1, 'a')
    >>> longest_common_subsequence("a", "b")  # Single character no match
    (0, '')
    >>> longest_common_subsequence("abcdef", "ace")  # Interleaved subsequence
    (3, 'ace')
    >>> longest_common_subsequence("ABCD", "ACBD")  # No repeated characters
    (3, 'ABD')
    """
    # find the length of strings

    assert x is not None
    assert y is not None

    m = len(x)
    n = len(y)

    # declaring the array for storing the dp values
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            match = 1 if x[i - 1] == y[j - 1] else 0

            dp[i][j] = max(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1] + match)

    seq = ""
    i, j = m, n
    while i > 0 and j > 0:
        match = 1 if x[i - 1] == y[j - 1] else 0

        if dp[i][j] == dp[i - 1][j - 1] + match:
            if match == 1:
                seq = x[i - 1] + seq
            i -= 1
            j -= 1
        elif dp[i][j] == dp[i - 1][j]:
            i -= 1
        else:
            j -= 1

    return dp[m][n], seq


if __name__ == "__main__":
    a = "AGGTAB"
    b = "GXTXAYB"
    expected_ln = 4
    expected_subseq = "GTAB"

    ln, subseq = longest_common_subsequence(a, b)
    print("len =", ln, ", sub-sequence =", subseq)
    import doctest

    doctest.testmod()
이 알고리즘에 대해

Problem Statement

Given two strings S and T, find the length of the longest common subsequence (LCS).

Approach

Let the dp[i][j] be the length of the longest common subsequence of prefixes S[1..i] and T[1..j]. Our answer (the length of LCS) is dp[|S|][|T|] since the prefix of the length of string is the string itself.

Both dp[0][i] and dp[i][0] are 0 for any i since the LCS of empty prefix and anything else is an empty string.

Now let's try to calculate dp[i][j] for any i, j. Let's say S[1..i] = *A and T[1..j] = *B where * stands for any sequence of letters (could be different for S and T), A stands for any letter and B stands for any letter different from A. Since A != B, our LCS doesn't include A or B as a last character. So we could try to throw away A or B character. If we throw A, our LCS length will be dp[i - 1][j] (since we have prefixes S[1..i - 1] and T[1..j]). If we try to throw B character, we will have prefixes S[1..i] and T[1..j - 1] so the length of LCS will be dp[i][j - 1]. As we are looking for the Longest common subsequence, we will pick the maximum value from dp[i][j - 1] and dp[i - 1][j].

But what if S[1..i] = *A and T[1..j] = *A? We could say that the LCS of our prefixes is LCS of prefixes S[1..i - 1] and T[1..j - 1] plus the letter A. So dp[i][j] = dp[i - 1][j - 1] + 1 if S[i] = T[j].

We could see that we can fill our dp table row by row, column by column. So our algorithm will be like:

  • Let's say that we have strings S of the length N and T of the length M (numbered from 1). Let's create the table dp of size (N + 1) x (M + 1) numbered from 0.
  • Let's fill the 0th row and the 0th column of dp with 0.
  • Then, we follow the algorithm:
for i in range(1..N):
    for j in range(1..M):
        if(S[i] == T[j])
            dp[i][j] = dp[i - 1][j - 1] + 1
        else
            dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])

Time Complexity

O(N * M) In any case

Space Complexity

O(N * M) - simple implementation O(min {N, M}) - two-layers implementation (as dp[i][j] depends on only i-th and i-th layers, we coudld store only two layers).

Example

Let's say we have strings ABCB and BBCB. We will build such a table:

# # A B C B
# 0 0 0 0 0
B 0 ? ? ? ?
B 0 ? ? ? ?
C 0 ? ? ? ?
B 0 ? ? ? ?

Now we will start to fill our table from 1st row. Since S[1] = A and T[1] = B, the dp[1][1] will be the maximal value of dp[0][1] = 0 and dp[1][0] = 0. So dp[1][1] = 0. But now S[2] = B = T[1], so dp[1][2] = dp[0][1] + 1 = 1. dp[1][3] is 1 since A != C and we pick max{dp[1][2], dp[0][3]}. And dp[1][4] = dp[0][3] + 1 = 1.

# # A B C B
# 0 0 0 0 0
B 0 0 1 1 1
B 0 ? ? ? ?
C 0 ? ? ? ?
B 0 ? ? ? ?

Now let's fill the other part of the table:

# # A B C B
# 0 0 0 0 0
B 0 0 1 1 1
B 0 0 1 1 2
C 0 0 1 2 2
B 0 0 1 2 3

So the length of LCS is dp[4][4] = 3.

Video Explanation

Video explanation by Tushar Roy