Invited Paper: Pattern Matching Using n-Gram Sampling of Cumulative Algebraic Signatures: Preliminary Results
17th International Workshop on Database and Expert Systems Applications (DEXA'06) (2006)
Sept. 4, 2006 to Sept. 8, 2006
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/DEXA.2006.80
Witold Litwin , Dauphine University, France
Riad Mokadem , Dauphine University, France
Philippe Rigaux , Dauphine University, France
Schwarz Thomas , Santa Clara University, California
We propose a novel string (pattern) matching algorithm called n-gram search. We intend it for the records stored once and searched many times in a database or a file, especially those organized in a Scalable Distributed Data Structure, (SDDS), over a grid or a structured P2P net. We presume that the records are encoded into their cumulative algebraic signatures, providing incidental confidentiality of stored data. The search starts with pre-processing the pattern, calculating the logarithmic algebraic signature (LAS) of the pattern and the LASs of every n-gram in it. The value of n \geqslant 1 is a parameter that one may tune. The search attempts to match the LASs of n-grams in the pattern towards dynamically calculated LASs, sampled over n-grams in the records. A mismatch generates a shift of up to K-n symbols towards next sample, where K is the pattern length. The whole process is parallel over the SDDS servers and does not require any local decoding. For an M-symbol long record, the unsuccessful search, measured as number of match attempts, costs O ((M-K) / (Kn+ 1)). The 2-grams should typically suffice, leading to O ((M-K) / (K-1)). We show that the algorithm particularly efficient for larger strings and records, i.e., with e-documents or DNA data. Preliminary results show then the n-gram search about (K - n + 1) faster than our previous algorithms and among the fastest known, e.g., probably often faster than Boyer-Moore. % MathType!MTEF!2!1!+- % feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn % hiov2DGi1BTfMBaeXatLxBI9gBaerbd9wDYLwzYbItLDharqqtubsr % 4rNCHbGeaGqiVu0Je9sqqrpepC0xbbL8F4rqqrFfpeea0xe9Lq-Jc9 % vqaqpepm0xbba9pwe9Q8fs0-yqaqpepae9pg0FirpepeKkFr0xfr-x % fr-xb9adbaqaaeGaciGaaiaabeqaamaabaabaaGcbaGaeyyzImlaaa!37B3! \[ \geqslant \]
SDDS, grid, structured P2P, scalable distributed pattern matching, algebraic signatures.
S. Thomas, W. Litwin, R. Mokadem and P. Rigaux, "Invited Paper: Pattern Matching Using n-Gram Sampling of Cumulative Algebraic Signatures: Preliminary Results," 17th International Workshop on Database and Expert Systems Applications (DEXA'06)(DEXA), Krakow, Poland, 2006, pp. 251-255.