文本比较算法Ⅵ——用线性空间计算最大公共子序列（翻译贴）-阿里云开发者社区

研究文本比较算法有一段时间了。近日研读了《A Linear Space Algorithm for Computing Maximal Common Subsequences》（D.S.Hirschberg著）。文章写于1975年。很多其他的论文都会引用这篇论文，可见这篇论文的质量。同时，该文作者D.S.Hirschberg也写了很多有关LCS的文章，也都是经典中的经典。

　　在研读这篇文章之后，我将它翻译成中文。由于本人的英语与文法都还不行，故翻译的质量也就一般了，也欢迎广大网友指正。

Introduction

导论

　　The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space. For stings of length 1000 (assuming coefficients of 1 microsecond and 1 byte) the solution would require 10⁶ microseconds (one second) and 10⁶bytes (1000K bytes). The former is easily accommodated, the latter is not so easily obtainable. If the strings were of length of 10000, the problem might not to be solvable in main memory for lack of space.

过去求解两个字符串的最长公共子序列的问题，需要花费二次的时间和空间。在求解两个长1000的字符串（假定时间系数为1微秒，空间系数是1字节）过程中需要10⁶微秒（1秒）和10⁶字节（1000K字节）。前者很容易解决，而后者不是很容易满足的。如果两个字符串的长度为10000，则可能没有足够的主内存空间来求解这个问题。

注：文章写于1975年，以当时的计算机的能力来看，1000K是个天量了。不过，就算是现在的计算机，如果没有良好的算法，在大容量的文本比较时就会出问题。比方说，文本是1M的，在O（MN）的情况下，需要1T的容量。这个可是够惊人的。

　　We present an algorithm which will solve this problem in quadratic time and in linear space. For Example, assuming coefficients of 2 microseconds and 10 bytes , for strings of length 1000 we would require 2 seconds and 10K bytes; for strings of length 10000 we would require a little over 3 minutes and 100K bytes.

我们提出一个解决该问题算法，该算法花费二次的时间和线性空间。举例来说，假定时间系数是2微秒，空间系数为10字节。求解两个长1000的字符串，我们要花费2秒和10K字节；求解两个长10000的字符串，我们花费仅仅增加到3分钟和100K字节。

String C=c₁c₂……c_p is a subsequence of string A=a₁a₂……a_m if and only if there is a mapping F:｛1，2，……，p｝→｛1，2，……，m｝ such that f(i)=k only if c_i is a_k and F is a monotone strictly increasing function(i.e. F(i)=u,F(j)=v,and i<j imply that u<v)

字符串C=c₁c₂……c_p是字符串A=a₁a₂……a_m的子序列，指的是存在一个映射F:｛1，2，……，p｝→｛1，2，……，m｝，当f(i)=k，则c_i=a_k，并且F是一个严格单调递增函数（举例说明：若F(i)=u,F(j)=v,当i<j 则u<v）

String C is a common subsequence of strings A and B if and only if C is a subsequence of A and C is a subsequence of B

字符串C是字符串A和B的公共子序列，当且仅当C既是A的子序列，同时C又是B的子序列

The problem can be stated as follows : Given strings A=a₁a₂……a_m and B=b₁b₂……b_n (over alphabet Σ), find a string C= c₁c₂……c_p such that C is a common subsequence of A and B and p is maximized

求解最长公共子序列问题定义如下：给定字符串A=a₁a₂……a_m和B=b₁b₂……b_n（覆盖字符集Σ），找到一个字符串C=c₁c₂……c_p，C是A和B的公共子序列之中p最大的那个

We call C an example of a maximal common subsequence.

我们又把C称作最大公共子序列

Notation. For string D=d₁d₂……d_r , D_kt is d_kd_k+1……d_t if k≤t; d_kd_k-1……d_t if k≥1. When k>t , we shall write ~D_kt so as to make clear that we are referring to a “reverse substring” of D

标记。对于字符串D=d₁d₂……d_r，D_kt表示为d_kd_k+1……d_t (k≤t)；d_kd_k-1……d_t(k≥1),当k>t时，我们标记为~D_kt，称为D的“反转子串”

L(i,j) is the maximum length possible of any common subsequence of A_1i and B_1j

L(i,j)表示为A_1i和B_1j的所有可能的公共子序列的长度中最大值。

x||y is the concatenation of strings x and y

x||y表示为字符串x和y的连接。

We present the algorithm described in [3], which takes quadratic time and space

我们提到的算法出自[3]，它花费二次时间和空间

Algorithm A

算法A

Algorithm A accepts as input strings A_1m and B_1n and produces as output the matrix L (where the element L(i,j) corresponds to our notation of maximum length possible of any common subsequence of A_1i and B_1j)

算法A接受输入字符串A_1m 和B_1n，并且计算输出矩阵L（矩阵元素L(i,j)如标记中所称，表示为A_1i和B_1j的所有可能的公共子序列的长度中最大值。）

ALG A(m,n,A,B,L)

1. Initialization: L(i,0)←0 [i=0……m];

L(0,j)←0 [j=0……n];

2. for i←1 to m do

begin

3. for j←1 to n do

if A(i)=B(j) then L(i,j)←L(i-1,j-1)+1

else L(i,j)←max{L(i,j-1),L(i-1,j)}

end

Proof of correctness of Algorithm A

论证算法A的正确性

To find L(i,j) ,let a common subsequence of that length be denoted by S(i,j)=c₁c₂……c_p , If a_i=b_j, we can do no better than by taking c_p=a_i and looking for c₁……c_p-1 as a common subsequence of length L(i,j)-1 of string A_1,i-1 and B_1,j-1. Thus , in this case ,L(i,j)=L(i-1,j-1)+1

为了计算L(i,j)，把长度和其相等的公共子序列定义为S(i,j)=c₁c₂……c_p，如果a_i=b_j，则c_p=a_i，并且c₁……c_p-1是A_1,i-1和B_1,j-1的最长公共子序列，长度为L(i,j)-1。因此，在这种情况下，L(i,j)=L(i-1,j-1)+1

If a_i≠b_j ,then c_p is a_i,b_j, or neither (but not both). If c_p is a_i , then a solution C to problem(A_1i,B_1j) [written P(i,j)] will be a solution to P(i,j-1) since b_j is not used. Similarly , if c_p is b_j, then we can get a solution to P(i,j) by solving P(i-1,j). If c_p is neither, then a solution to either P(i-1,j) or P(i,j-1) will suffice . In determining the length of the solution, it is seen that L(i,j) [corresponding to P(i,j)] will be the maximum of L(i-1,j) and L(i,j-1).

如果a_i≠b_j，则c_p要么是a_i，要么是b_j，要么两者都不是（肯定不会都是）。如果c_p=a_i，因为b_j不是C的元素，则求解C的问题(A_1i,B_1j)[写作P(i,j)]等同于求解P(i,j-1)。同样的，如果c_p=b_j，求解P(i,j)等同于求解P(i-1,j)。如果，c_p两者都不是，则必是P(i-1,j)和P(i,j-1)中的一个。求解的长度称为L(i,j)[和P(i,j)相一致]将会是L(i-1,j) 和L(i,j-1)中的最大值。

Time and Space Analysis of Algorithm A

算法A的时间和空间分析

The if statement in Algorithm A will be executed exactly mn times. Input and output arrays require m+n+(m+1)(n+1) locations. Thus Algorithm A requires O(mn) time and O(mn) space.

算法A中的判断语句将会精确的执行mn次。输入和输出占用的空间需要m+n+(m+1)(n+1)位置。因此，算法A需要O(mn)时间和O(mn)空间。

　　Algorithm B

　　算法B

　　In Algorithm A, the derivation of row i of matrix L(L(i,1), L(i,2),…… ,L(i,n)) requires only row i-1 of matrix L. Thus , a slight modification yields Algorithm B, which accepts as input strings A_1m and B_1n and produces as output vector LL where LL(j) will have the value L(m,j)

在算法A中，推导出L矩阵中的第i行（L(i,1), L(i,2), ……,L(i,n)）只需要矩阵L的第i-1行。一个细小的改观生成了算法B，该算法接受输入字符串A_1m 和B_1n，输出向量LL，LL(j)的值就是矩阵L中的L(m,j)。

ALG B(m,n,A,B,LL)

1. Initialization:K(1,j)←0 [j=0……n];

2. for i←1 to n do

begin

3. K(0,j) ←K(1,j) [j=0……n]

4. for j←1 to n do

if A(i)=B(j) then K(1,j) ←K(0,j-1)+1

else K(1,j) ←max{K(1,j-1),K(0,j)}

end

5. LL(j) ←K(1,j) [j=0……n]

Proof of Correctness of Algorithm B

论证算法B的正确性

Algorithm B is Algorithm A with K(0,j) in statement 4 of ALG B having the same value as L(i-1,j) in statement 3 of ALG A and K(1,j) receiving the same value as L(i,j). We show this by induction on i.

算法B和算法A等价，就像ALG B中的第四步计算K(0,j)的值和ALG A中的第三步计算L(i-1,j)的值是一样的。同理，K(1,j)的值和L(i,j)的值一样。下面我们将根据i的值进行归纳说明。

For i=1 , L(i-1,j) is zero (initialized in statement 1 of ALG A). In ALG B, K(0,j) received in statement 3 the value of K(1,j) , which was just initialized to zero in statement 1.

当i=1，L(i-1,j)为0（在ALG A中的第一步初始化数据后）。在ALG B中的第3步中，K(0,j)从K(1,j)获得0值，因为K(1,j)在第一步中就已经初始化为0了。

Assuming K(0,j) has the same value as does L(i-1,j). Then K(1,j) receives the same value as L(i,j) since the assignment statement within the inner loops of ALG A and ALG B are equivalent . For the next iteration, K(0,j) receives (in statement 3 of ALG B) the value of K(1,j) which has the value of L(i,j) as shown above.

假定K(0,j)和L(i-1,j)值一样。那么K(1,j)像L(i,j)一样获取同样的值，因为在ALG A和ALG B中指定的循环步骤是一致的。在下一个循环之前，K(0,j)获取K(1,j)的值（在ALG B中的第3步），就像上面所示，K(1,j)的值就是L(i,j)。

Time and Space Analysis of Algorithm B

算法B的时间和空间分析

　　As in Algorithm A , the if statement in Algorithm B is executed exactly mn times. Input and output arrays require m+n+(n+1) locations. Local storage requires 2(n+1) locations. Thus Algorithm B requires O(mn) time and O(m+n) space.

和算法A一样，算法B的判断语句也会精确的执行mn次。输入和输出占用m+n+(n+1)位置，算法内部占用2(n+1)位置。因此算法B需要O(mn)时间和O(m+n)空间。

　　注：该算法还可以优化，使得LL()只占用n个位置，而不是2n个位置

We shall show that using Algorithm B for appropriate substrings of A and B will enable us to recover a maximal common subsequence of A and B in linear space.

接下来，我们要用算法B在线性空间中利用A和B的合适子串来找回A和B的最大公共子序列。

Define L*(i,j) to be the maximum length of common subsequence of A_i+1,m and B_j+1,n.

定义L*(i,j)是A_i+1,m和B_j+1,n的最大公共子序列

We note that L(i,j) j=0……n are the maximum lengths of common subsequence of A_1i and various prefixes of B_1n. We also note that L*(i,j) j=0……n are the maximum lengths of common subsequence of ~A_m,i+1 and various prefixes of ~B_n,1.Choosing i to be m/2 and using the theorem below , we shall be able to determine a prefix B₁ of B which can be matched with the first half A₁ of A (and the corresponding suffix B₂ of B matched with the last half A₂ of A) such that a maximal common subsequence (mcs) of A₁ and B₁ concatenated with an mcs of A₂ and B₂ will be an mcs of A and B

　　我们注意到L(i,j) j=0……n表示A_1i和B_1n的一些前缀的公共子序列的长度最大值。我们同时注意到L*(i,j) j=0……n表示~A_m,i+1和~B_n,1的一些前缀的公共子序列的长度最大值。在下面的定理中，令i为m/2，我们能确定B的一个前缀B₁能和A的前半部分A₁匹配（同时相对应的B的后缀B₂和A的后半部分A₂匹配）。如此，A₁和B₁的最大公共子序列和A₂和B₂的最大公共子序列连接起来就是A和B的最大公共子序列。

　　Define 　　M(i)=max{L(i,j)+L*(i,j)}　　0≤i≤n

　　THEOREM　For 　0≤i≤m, M(i)=L(m,n)

PROOF . Let M(i)=L(i,j)+L*(i,j) for some j. Let S(i,j) be any maximal common subsequence of A_1i and B_1j; let S*(i,j) be any maximal common subsequence of A_i+1,m and B_j+1,n . Then C=S(i,j) || S*(i,j) is a common subsequence of A_1m and B_1nof length M(i). Thus L(m,n)≥M(i)

证明。当j取某些值的时候，M(i)=L(i,j)+L*(i,j)。让S(i,j)是某个A_1i和B_1j的最大公共子序列；让S*(i,j)是某个A_i+1,m和B_j+1,n的最大公共子序列。那么C=S(i,j) || S*(i,j)就是A_1m和B_1n的一个公共子序列，且长度为M(i)。因此L(m,n)≥M(i)

Let S(m,n) be any maximal common subsequence of A_1m and B_1n. S(m,n) is a subsequence of B that is S₁ (a subsequence of A_1i) || S₂ ( a subsequence of A_i+1,m). Then there exists j such that S₁ is a subsequence of B_1j and S₂ is a subsequence of B_j+1,n . By definition of L and L*, |S₁|≤L(i,j) and |S₂|≤L*(i,j). Thus L(m,n)=|S(m,n)|=|S₁|+|S₂|≤L(i,j)+L*(i,j) ≤M(i)

让S(m,n)是某个A_1m和B_1n的最大公共子序列。S(m,n)是B的一个子序列，且就是S₁ (A_1i的一个子序列) || S₂ (A_i+1,m的一个子序列)。那么，必存在j，使得S₁是B_1j的子序列，同时S₂ 是B_j+1,n的子序列。根据L和L*的定义，|S₁|≤L(i,j) 同时|S₂|≤L*(i,j)。因此，L(m,n)=|S(m,n)|=|S₁|+|S₂|≤L(i,j)+L*(i,j) ≤M(i)

注：典型的数学证明法。要证明A=B。先证明A≥B，再证明A≤B。

Algorithm C

算法C

We now apply the above theorem recursively to divide a given problem into two smaller problems until we obtain a trivial subproblem.

现在，我们根据上面的定理递归的将一个给定的问题分成两个小问题直到能获得一个不成问题的子问题。

Algorithm C accepts as input stings A and B (of length m and n) and produces as output a common subsequence C of A and B that is of Maximum length p.

算法C获得输入字符串A和B（长度分别为m和n），然后计算输出A和B的一个公共子序列C，且C是最长的，长度为p

ALG C(m,n,A,B,C)

1. If problem is trivial , solve it:

If n=0 then C←e (e is empty string)

Else if m=1 then if 存在j≤n such that A(1)=B(j)　　注：存在的标记没法打出来，故用了中文

Then C←A(1)

Else C←e

2. Otherwise, split problem:

Else begin i←[m/2]

3. Evaluate L(i,j) and L*(i,j) [j=0……n]:

ALG B(i,n,A_1i,B_1n,L1)

ALG B(m-i,n,~A_n,i+1,~B_n1,L2)

4. Find j such that L(i,j)+L*(i,j)=L(m,n) using theorem:

M←max{L1(j)+L2(n-j)};0≤i≤n

k←min j such that L1(j)+L2(n-j)=M

5. Solve simpler problems:

ALG C (i,k,A_1i,B_1k,C1)

ALG C (m-i,n-k,A_i+1m,B_k+1,n,C2)

6. Give output

C←C1 || C2

end

Proof of Correctness of Algorithm C

论证算法C的正确性

L1(j) produced by the first call to ALG B in line 3 is equal to L(i,j). This was shown in the proof of correctness of Algorithm B. Similarly , L2(j) is equal to the maximum length of common subsequence (max lcs) of ~A_m,i+1 and ~B_n,n-j+1 by the proof of correctness of Algorithm B .

在第3行第一次调用ALG B计算出的L1(j)等价于L(i,j)，这个在“论证算法B的正确性”中就说明了。同样的，L2(j)等同于~A_m,i+1和~B_n,n-j+1的公共子序列中的长度最大值也在“论证算法B的正确性”中说明了。

L2(n-j)=max lcs of ~A_m,i+1 and ~B_n,j+1=max lcs of A_i+1,m and B_j+1,n=L*(i,j)

By our theorem , we can find k (as in line 4) such that L(i,k)+L*(i,k)=L(m,n). So there must exist solutions C1 and C2 to the subproblems (A_1i, B_1k) and (A_i+1,m,B_k+1,n) such that C1 || C2 will be a common subsequence of A and B of length L(m,n). The solutions to the subproblems are obtained in line 5 and are added together in line 6 to obtain the final output .

根据我们的定理，我们能找到k（在第4步），使得L(i,k)+L*(i,k)=L(m,n)。那么就一定存在C1和C2，分别是子问题(A_1i, B_1k)和子问题(A_i+1,m,B_k+1,n)的解，而C1 || C2是A 和B的长度为L(m,n)的公共子序列。求解过程在第5步获得子问题的解，并在第6步将两个子问题的解连接起来并最终输出。

Time Analysis of Algorithm C

算法C的时间分析

For P(1,n) we look for a single match . For some constants c₁ and c₂ this is time-bounded by c₁n+c₂

针对P(1,n)，我们找到一个单独的匹配。给定常量c₁和c₂，时间临界点为c₁n+c₂

For P(2m,n) , let operations on vectors that are linear in m or n be time-bounded by c₃m+c₄n+c₅. That leaves two calls to ALG B and two calls to ALG C. the calls to ALG B are bounded by c₆mn by time analysis of ALG B . Assume P(m,n) is time-bounded by d₁mn+d₂(d₁≥c₁,d₂≥c₂). Then the calls to ALG C will be time-bounded by d₁mk+d₂ and d₁m(n-k)+d₂. Thus a total time-bound T for P(2m,n) will be T=(d₁+c₆)mn+c₃m+c₄n+c₅+2d₂. For n≥1,T≤(d₁+c₆+c₃+c₄+c₅+d₂)mn+d₂. For n=0 , let T≤d₂ . Then to be consistent with our assumption on the time-bound of P(m,n) , we must have d₁+c₆+c₃+c₄+c₅+d₂≤2d₁ , which is realizable by letting d₁=c₆+c₃+c₄+c₅+d₂.

针对P(2m,n)，在向量的操作上，时间是关于m或者n线性的，时间临界点是c₃m+c₄n+c₅。还要执行两次ALG B和两次ALG C。根据ALG B的时间分析，执行ALG B的时间临界点为c₆mn。假定P(m,n)的时间临界点为d₁mn+d₂(d₁≥c₁,d₂≥c₂)。那么执行两次ALG C的时间临界点分别为d₁mk+d₂和d₁m(n-k)+d₂。因此，P(2m,n)的总共时间临界点T，将会是T=(d₁+c₆)mn+c₃m+c₄n+c₅+2d₂。当n≥1，T≤(d₁+c₆+c₃+c₄+c₅+d₂)mn+d₂。当n=0时，T≤d₂。那么就象我们始终如一的假定P(m,n)的时间临界点，我们一定会得到d₁+c₆+c₃+c₄+c₅+d₂≤2d₁。不妨写成d₁=c₆+c₃+c₄+c₅+d₂。

Thus Algorithm C has an O(mn) time bound.

因此算法C需要时间O(mn)。

注：由于英文水平有限，这一段翻的很干涩。其实当看到T=(d₁+c₆)mn+c₃m+c₄n+c₅+2d₂时，就知道算法C的时间为O(mn)。因为式子的最高次是mn

Space Analysis of Algorithm C

算法C的空间分析

We assume that vectors A and B are in common storage and substrings can be transferred as arguments by giving initial and final locations.

我们假设向量A和B存储在公共空间并且他们的子串能作为传递参数在初始化过程和确定的位置。

Then ,during execution, the calls to ALG B use temporary storage which is linear in m and n (see space analysis of Algorithm B) . It is seen that ,exclusive of recursive calls to ALG C , ALG C uses a constant amount of memory space. There are 2m-1 calls to ALG C (proven below) , and so ALG C requires memory space proportional to m and n , i.e. O(m+n) space.

那么在整个计算过程中，调用ALG B临时存储空间是和m和n的线性相关的（在算法B的空间分析里说明）。这就像是，独享的递归调用ALG C，ALG C占用一块总量固定的内存空间。一共要调用2m-1次ALG C（在后面证明），那么所以ALG C需要的内存空间为和m和n成比例，也就是O(m+n)空间

Proof That There Are 2m-1 Calls to ALG C

证明，一共调用2m-1次ALG C

Let m≤2^r. If r is zero , then m is one , and there are 2¹-1=1 call to ALG C

设m<=2^r。如果r是0，那么m是1，那么一共有2¹-1=1次调用ALG C

Assume that for m≤2^r=M there are 2m-1 to calls to ALG C. For m’ ≤2^r+1=2M, i will be set equal to at most M in line 2. There will be two calls to ALG C with first parameters m₁ and m₂ such that m₁+m₂=m’ and both m₁ and m₂ are at most M . By assumption , each for these calls will generate a total of 2m_i-1 calls to ALG C . Adding in the initial call results in a total of :(2m₁-1)+2(m₂-1)+1=2(m₁+m₂)-1=2m’-1 calls.

假设m≤2^r=M成立，那么一共有2m-1次调用ALG C。当m’ ≤2^r+1=2M时，那么在第2步，i将会等于接近M的值。一共调用2次第一个参数分别是m₁和m₂的ALG C，且m₁+m₂=m’且m₁和m₂都接近M，根据假设，调用这2次ALG C则一共需要调用2m_i-1次ALG C。加上第一次的调用，总计为(2m₁-1)+2(m₂-1)+1=2(m₁+m₂)-1=2m’-1次调用。

注：典型的数学归纳法证明。先证明r=0成立，再假设2^r=M时成立，再证明2^r+1=2M时成立

Algorithm C can be modified to find the edit distance between two strings (as defined in [3]). In this case we would seek to minimize D(m,n), the cost of our theorem would be : for all I,

算法C能修改成找寻两个字符串的编辑距离（在[3]中的定义）。在这种情况下，我们设定最小的D(m,n)。在算法的定理中将会是

D(m,n)=min{D(i,j)+D*(i,j)}　　0≤i≤n

The modified Algorithm C would split problems in half by the above theorem, using a modified Algorithm B to evaluate D(i,j) and D*(i,j), and call itself recursively.

在上面的定理中，修改过的算法C将会对半分割问题，用修改过的算法B计算D(i,j) 和D*(i,j)，然后再自身递归调用。

Received May 1974;revised November 1974

References

参考文章

1. Chvatal, V.,Klarner, D.A., and Knuth, D.E. Selected combinatorial research problems. STAN-CS-72-292, Stanford U., (June 1972),26

2. Private communication from D.Knuth to J.D. Ullman.

3. Wagner, R.A., and Fischer, M.J. The string-to-string correction problem. J. ACM 21 , 1 (Jan ,1974) 168-173

4. Aho, A. V., Hirschberg, D.S., and Ullman, J.D. Bounds on the complexity of the longest common subsequence problem. Proc . 15th Ann. Symp. on Swiching and Automata Theory, 1974,pp.104-109

本文转自万仓一黍博客园博客，原文链接：http://www.cnblogs.com/grenet/archive/2011/02/27/1959223.html，如需转载请自行联系原作者

文本比较算法Ⅵ——用线性空间计算最大公共子序列（翻译贴）

热门文章

最新文章

相关课程

相关电子书

相关实验场景