"""
LCS Problem Statement: Given two sequences, find the length of longest subsequence
present in both of them. A subsequence is a sequence that appears in the same relative
order, but not necessarily continuous.
Example:"abc", "abg" are subsequences of "abcdefgh".
"""
def longest_common_subsequence(x: str, y: str):
"""
Finds the longest common subsequence between two strings. Also returns the
The subsequence found
Parameters
----------
x: str, one of the strings
y: str, the other string
Returns
-------
L[m][n]: int, the length of the longest subsequence. Also equal to len(seq)
Seq: str, the subsequence found
>>> longest_common_subsequence("programming", "gaming")
(6, 'gaming')
>>> longest_common_subsequence("physics", "smartphone")
(2, 'ph')
>>> longest_common_subsequence("computer", "food")
(1, 'o')
>>> longest_common_subsequence("", "abc") # One string is empty
(0, '')
>>> longest_common_subsequence("abc", "") # Other string is empty
(0, '')
>>> longest_common_subsequence("", "") # Both strings are empty
(0, '')
>>> longest_common_subsequence("abc", "def") # No common subsequence
(0, '')
>>> longest_common_subsequence("abc", "abc") # Identical strings
(3, 'abc')
>>> longest_common_subsequence("a", "a") # Single character match
(1, 'a')
>>> longest_common_subsequence("a", "b") # Single character no match
(0, '')
>>> longest_common_subsequence("abcdef", "ace") # Interleaved subsequence
(3, 'ace')
>>> longest_common_subsequence("ABCD", "ACBD") # No repeated characters
(3, 'ABD')
"""
# find the length of strings
assert x is not None
assert y is not None
m = len(x)
n = len(y)
# declaring the array for storing the dp values
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
match = 1 if x[i - 1] == y[j - 1] else 0
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1] + match)
seq = ""
i, j = m, n
while i > 0 and j > 0:
match = 1 if x[i - 1] == y[j - 1] else 0
if dp[i][j] == dp[i - 1][j - 1] + match:
if match == 1:
seq = x[i - 1] + seq
i -= 1
j -= 1
elif dp[i][j] == dp[i - 1][j]:
i -= 1
else:
j -= 1
return dp[m][n], seq
if __name__ == "__main__":
a = "AGGTAB"
b = "GXTXAYB"
expected_ln = 4
expected_subseq = "GTAB"
ln, subseq = longest_common_subsequence(a, b)
print("len =", ln, ", sub-sequence =", subseq)
import doctest
doctest.testmod()
Given two strings S
and T
, find the length of the longest common subsequence (LCS).
Let the dp[i][j]
be the length of the longest common subsequence of prefixes S[1..i]
and T[1..j]
. Our answer (the length of LCS) is dp[|S|][|T|]
since the prefix of the length of string is the string itself.
Both dp[0][i]
and dp[i][0]
are 0
for any i
since the LCS of empty prefix and anything else is an empty string.
Now let's try to calculate dp[i][j]
for any i
, j
. Let's say S[1..i] = *A
and T[1..j] = *B
where *
stands for any sequence of letters (could be different for S
and T
), A
stands for any letter and B
stands for any letter different from A
. Since A != B
, our LCS doesn't include A
or B
as a last character. So we could try to throw away A
or B
character. If we throw A
, our LCS length will be dp[i - 1][j]
(since we have prefixes S[1..i - 1]
and T[1..j]
). If we try to throw B
character, we will have prefixes S[1..i]
and T[1..j - 1]
so the length of LCS will be dp[i][j - 1]
. As we are looking for the Longest common subsequence, we will pick the maximum value from dp[i][j - 1]
and dp[i - 1][j]
.
But what if S[1..i] = *A
and T[1..j] = *A
? We could say that the LCS of our prefixes is LCS of prefixes S[1..i - 1]
and T[1..j - 1]
plus the letter A
. So dp[i][j] = dp[i - 1][j - 1] + 1
if S[i] = T[j]
.
We could see that we can fill our dp
table row by row, column by column. So our algorithm will be like:
S
of the length N and T
of the length M (numbered from 1). Let's create the table dp
of size (N + 1) x (M + 1)
numbered from 0.dp
with 0.for i in range(1..N):
for j in range(1..M):
if(S[i] == T[j])
dp[i][j] = dp[i - 1][j - 1] + 1
else
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
O(N * M)
In any case
O(N * M)
- simple implementation
O(min {N, M})
- two-layers implementation (as dp[i][j]
depends on only i-th and i-th layers, we coudld store only two layers).
Let's say we have strings ABCB
and BBCB
. We will build such a table:
# # A B C B
# 0 0 0 0 0
B 0 ? ? ? ?
B 0 ? ? ? ?
C 0 ? ? ? ?
B 0 ? ? ? ?
Now we will start to fill our table from 1st row. Since S[1] = A
and T[1] = B
, the dp[1][1]
will be the maximal value of dp[0][1] = 0
and dp[1][0] = 0
. So dp[1][1] = 0
. But now S[2] = B = T[1]
, so dp[1][2] = dp[0][1] + 1 = 1
. dp[1][3]
is 1
since A != C
and we pick max{dp[1][2], dp[0][3]}
. And dp[1][4] = dp[0][3] + 1 = 1
.
# # A B C B
# 0 0 0 0 0
B 0 0 1 1 1
B 0 ? ? ? ?
C 0 ? ? ? ?
B 0 ? ? ? ?
Now let's fill the other part of the table:
# # A B C B
# 0 0 0 0 0
B 0 0 1 1 1
B 0 0 1 1 2
C 0 0 1 2 2
B 0 0 1 2 3
So the length of LCS is dp[4][4] = 3
.