用于查找 2 个字符串之间任意长度的所有共享子字符串，然后计算字符串 2 中出现次数的算法?

编程入门行业动态更新时间:2024-10-21 17:25:08

本文介绍了用于查找 2 个字符串之间任意长度的所有共享子字符串，然后计算字符串 2 中出现次数的算法?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我遇到了一个不寻常的挑战，到目前为止我无法确定最有效的算法来解决这个问题.

以下面2个字符串为例，找出这2个任意长度字符串之间的所有共享子串，并计算所有这些共享子串在字符串2中出现的次数.你的算法还需要能够计算包含大小高达 100MB 或更大的字符串的文件之间的共享子字符串.

示例:

字符串 1:ABCDE512ABC361EG51D

字符串 2:ADE5AHDW4131EG1DG5C

给定这 2 个字符串，该算法将找到以下共享子字符串:A,C,D,E,5,1,3,G,DE,E5,EG,G5,1D,DE5,1EG

然后从这些共同共享的子字符串中，我们会找到每个字符串在字符串 2 中出现的次数.

A:字符串 2 中出现 2 次

C: 字符串 2 中出现 1 次

D: 字符串 2 中出现 3 次

等等.

我为解决这个问题而采取的第一种方法是使用 2 个嵌套的 for 循环强制我通过计算公共共享子字符串的方式 - 显然效率最低，但这是一种快速而肮脏的方式来了解什么预期的输出应该是较小的测试输入和最慢的运行时间，计算包含大小为 50kb 的 ascii 字符串的 2 个文件之间的所有公共共享子字符串大约需要 2 分钟.将大小增加到 1mb 后，由于必须进行大量嵌套迭代来计算它，因此它会突然停止.

下一个方法是使用树 - 看看我可以折衷多少内存来优化计算时间.这种方法要快得多.使用蛮力方法需要 2 分钟的相同的两个 50kb 文件几乎是即时的.运行 1mb 文件的速度仍然非常快(几秒钟)，但随着我继续使用越来越大的文件大小进行测试，由于树的大小，我很快开始遇到内存问题.

注意:字符串文件将只包含 ASCII 字符！

我正在进一步升级，请参阅:

gist.github/braydo25/f7a9ce7ce7ad7c5fb11ec511887789bc>

解决方案

这是一个基于遍历输入串联后缀数组的 C 实现，借助最长公共前缀数组.您可以将编程竞赛级 (O(n log^2 n)) 后缀数组实现替换为真正的 (O(n) 或 O(n log n)) 以大幅提高性能.(这样做了，其他一些变化反映了提问者的新要求:github/eisenstatdavid/公共子 .)

#include #include #include #include #include typedef int_fast32_t I32;#define Constrain(expression) _Static_assert(expression, #expression)约束(CHAR_BIT == 8)；#define InputMaxBytes 80000000约束(InputMaxBytes <= (INT_LEAST32_MAX - 2)/2);#define MaxLen (2 * InputMaxBytes + 2)约束(MaxLen InputMaxBytes) {fprintf(stderr, "%s: 文件太大；增加 InputMaxBytes ",文件名);退出(EXIT_FAILURE)；}for (I32 i = 0; i

I've run into an unusual challenge and so far I'm unable to determine the most efficient algorithm to attack this.

Given the following 2 strings as an example, find all commonly shared substrings between the 2 strings of any length, and count the number of occurrences of all of those shared substrings in string 2. Your algorithm also needs to be able to compute shared substrings between files containing strings that are up to 100MB or more in size.

Example:

String 1: ABCDE512ABC361EG51D

String 2: ADE5AHDW4131EG1DG5C

Given these 2 strings this algorithm would find the following shared substrings: A,C,D,E,5,1,3,G,DE,E5,EG,G5,1D,DE5,1EG

And then from these commonly shared substrings, we'd find how many occurences there are of each of them in string 2.

A: 2 occurrences in string 2

C: 1 occurence in string 2

D: 3 occurrences in string 2

etc..

The first approach I took to wrap my head around this problem was brute forcing my way through computing the common shared substrings using 2 nested for loops - obviously the least efficient but it was a quick and dirty way to get an idea of what the expected outputs should be with smaller test input and the slowest possible time to run, which was around 2 minutes to compute all common shared substrings between 2 files containing ascii strings with a size of 50kb. Upping the size to 1mb made this come to a screeching halt due to the massive number of total nested iterations that had to occur to compute this.

The next approach was using trees - seeing how much memory I could trade off to optimize compute time. This approach was much faster. The same two 50kb files that took 2 minute with the brute force method were near instant. Running against 1mb files was very fast too still (seconds) but as I continued to test with larger and larger file sizes, I quickly began running into memory issues due to tree sizes.

Note: The string files will only ever contain ASCII characters!

Edit:

I'm escalating this a bit further, please see:

gist.github/braydo25/f7a9ce7ce7ad7c5fb11ec511887789bc

解决方案

Here's a C implementation based on traversing the suffix array of the concatenation of the inputs, with the help of the longest common prefix array. You can replace the programming-contest-grade (O(n log^2 n)) suffix array implementation with a real one (O(n) or O(n log n)) for a large performance improvement. (EDIT: did this, with some other changes reflecting the asker's new requirements: github/eisenstatdavid/commonsub .)

#include <inttypes.h> #include <limits.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> typedef int_fast32_t I32; #define Constrain(expression) _Static_assert(expression, #expression) Constrain(CHAR_BIT == 8); #define InputMaxBytes 80000000 Constrain(InputMaxBytes <= (INT_LEAST32_MAX - 2) / 2); #define MaxLen (2 * InputMaxBytes + 2) Constrain(MaxLen <= INT_FAST32_MAX / 2); static I32 Len; static I32 Begin2; static signed char Buf[MaxLen]; static int_least32_t SufArr[MaxLen]; static int_least32_t SufRank[MaxLen]; static int_least32_t NewRank[MaxLen]; static int_least32_t *const LongCommPre = NewRank; // aliased to save space static uint_least64_t Bitmap2[(MaxLen >> 6) + 1]; static int_least32_t SparseCount2[(MaxLen >> 6) + 1]; static int_least32_t *const Stack = SufRank; // aliased to save space static void Slurp(const char *filename) { FILE *stream = fopen(filename, "r"); if (stream == NULL) goto fail; I32 n = fread(Buf + Len, sizeof *Buf, InputMaxBytes + 1, stream); if (ferror(stream)) goto fail; if (n > InputMaxBytes) { fprintf(stderr, "%s: file is too large; increase InputMaxBytes ", filename); exit(EXIT_FAILURE); } for (I32 i = 0; i < n; i++) { if (Buf[Len + i] < 0) { fprintf(stderr, "%s: file contains non-ASCII byte at offset %" PRIdFAST32 " ", filename, i); exit(EXIT_FAILURE); } } Len += n; if (fclose(stream) == EOF) goto fail; return; fail: perror(filename); exit(EXIT_FAILURE); } static I32 Radix; static int CompareRankPairs(const void *iPtr, const void *jPtr) { I32 i = *(const int_least32_t *)iPtr; I32 j = *(const int_least32_t *)jPtr; if (SufRank[i] < SufRank[j]) return -1; if (SufRank[i] > SufRank[j]) return 1; I32 iRank = i + Radix < Len ? SufRank[i + Radix] : -2; I32 jRank = j + Radix < Len ? SufRank[j + Radix] : -2; if (iRank < jRank) return -1; if (iRank > jRank) return 1; return 0; } static void BuildSuffixArray(void) { for (I32 i = 0; i < Len; i++) { SufArr[i] = i; SufRank[i] = Buf[i]; } for (Radix = 1; true; Radix *= 2) { qsort(SufArr, Len, sizeof *SufArr, CompareRankPairs); NewRank[0] = 0; for (I32 i = 1; i < Len; i++) { NewRank[i] = CompareRankPairs(&SufArr[i - 1], &SufArr[i]) == 0 ? NewRank[i - 1] : NewRank[i - 1] + 1; } for (I32 i = 0; i < Len; i++) { SufRank[SufArr[i]] = NewRank[i]; } if (NewRank[Len - 1] == Len - 1) break; } I32 lenCommPre = 0; for (I32 i = 0; i < Len; i++) { if (SufRank[i] == Len - 1) { LongCommPre[SufRank[i]] = -1; continue; } while (Buf[i + lenCommPre] == Buf[SufArr[SufRank[i] + 1] + lenCommPre]) { lenCommPre++; } LongCommPre[SufRank[i]] = lenCommPre; if (lenCommPre > 0) lenCommPre--; } } static I32 PopCount(uint_fast64_t x) { I32 v = 0; while (x != 0) { x &= x - 1; v++; } return v; } static void BuildCumCount2(void) { for (I32 i = 0; i < Len; i++) { if (SufArr[i] >= Begin2) { Bitmap2[i >> 6] |= UINT64_C(1) << (i & 63); SparseCount2[i >> 6]++; } } for (I32 i = 0; i < (Len >> 6); i++) { SparseCount2[i + 1] += SparseCount2[i]; } } static I32 CumCount2(I32 i) { return SparseCount2[i >> 6] - PopCount(Bitmap2[i >> 6] >> (i & 63)); } static void FindCommonStrings(void) { I32 lenCommPre = -1; for (I32 i = 0; i < Len; i++) { while (lenCommPre > LongCommPre[i]) { I32 begin = Stack[lenCommPre]; I32 end = i + 1; I32 count2 = CumCount2(end) - CumCount2(begin); if (count2 > 0 && count2 < end - begin && lenCommPre > 0) { printf("%" PRIdFAST32 " %.*s ", count2, (int)lenCommPre, Buf + SufArr[begin]); } lenCommPre--; } while (lenCommPre < LongCommPre[i]) { lenCommPre++; Stack[lenCommPre] = i; } } } int main(int argc, char *argv[]) { if (argc != 3) { fputs("usage: commonsub needle haystack ", stderr); exit(EXIT_FAILURE); } Len = 0; Slurp(argv[1]); Buf[Len] = -1; Len++; Begin2 = Len; Slurp(argv[2]); Buf[Len] = -2; // sentinel BuildSuffixArray(); if (false) { for (I32 i = 0; i < Len; i++) { printf("%" PRIdFAST32 " %" PRIdLEAST32 " %" PRIdLEAST32 " %.*s ", i, SufArr[i], LongCommPre[i], (int)(Len - SufArr[i]), Buf + SufArr[i]); } } BuildCumCount2(); FindCommonStrings(); }