Thesis | Approximate String Matching: Theory and Applications

ByIbrahim Chegrane

Thesis | Approximate String Matching: Theory and Applications

Research filed(s): Text Algorithms, String Algorithms
Thesis title: Approximate String Matching: Theory and Applications
Degree: Ph.D in Computer Science
Supervisor(s): Dr Nacéra Bensaou, Dr Djamal Belazzougui, Dr Mathieu Raffinot

Where did you undertake your research? I undertook the research of my thesis in the Laboratory for Research in Artificial Intelligence (LRIA) at the University of Sciences and Technology Houari Boumediene (USTHB) in Algiers, Algeria. Part of my thesis work was done at the Laboratoire d’Informatique Algorithmique: Fondements et Applications (LIAFA) Paris Diderot, Paris 7 University, France.

What is your thesis about?
The research topic of my thesis is ”Approximate string matching: theory and applications” where we build indices for text and dictionaries in a way that allows fast, exact and approximate search, allowing for typographical errors.

An exact search consists of finding a query word without error. For example, looking for the word ibrahim, so we find exactly that word in the text (or in the dictionary) and/or all its positions. In practice, the exact search is not always adequate if the query word and/or words in the dictionary or text contain typographical errors. Depending on the domain, errors may be due to several reasons: for example, a typing error during a quick entry, or texts that come from an Optical Character Recognition tool (OCR) may contain errors of miss-recognition, etc. Approximate search solves this problem by handling errors and finding occurrences that are lexically close to query word. For example, looking for the word ibrahim, we find: ibrahimovic, abraham, blahim, brahim, etc.

Text Algorithms, is a field that deals with the study of textual data structures and text algorithms. The question for me was; what contribution can I bring that will build on existing work in this area? In turns out that researchers in this area have found that there is a gap between theoretical and practical results in the string matching domain. That is, a number of improvements have been made to several methods of solving this problem in terms of dealing with its theoretical complexity without considering practical aspects if their application.

This poses an important problem, particularly to the software developer community, or bioinformatics researchers, but also to all non-specialist text algorithmic users when it comes to choosing and using a good pattern matching algorithm.

This was the main motivation behind my doctoral thesis.

This thesis deals with the search for efficient solutions in practice, without compromising theoretical complexities associated with performance-related problems of approximate string matching in different contexts of its applications. The goal is to propose algorithms: efficient, valid in practice and competitive, and to produce at the end libraries that can be used directly by developers and researchers.

How did you get into this research?
I did my Bachelor and Master degree projects  in the field of bioinformatics under the supervision of Mrs. C. Ighilaza where I worked on the problem of multiple alignment of biological sequences, which is an application domain for text algorithms. In both projects, I got a score of 17/20. I learned a lot, but I wanted to continue my studies in order to progress more and acquire new knowledge and skills. I entered the doctoral recruitment competition, and I was accepted.

Mrs C.Ighilaza advised me and encouraged me to work with Dr Nacéra Bensaou, who works in the field of text algorithms, and who was the jury president of my master degree project, so she knew who I was. I contacted her, and she agreed to be my supervisor, and gave me this doctoral proposal. She then put me in touch with Dr Djamal Belazzougui, who is a specialist in this field, and he in turn put us in touch with Dr Mathieu Raffinot. So, all these people contributed to the advancement of my thesis with their help, their advice, and their directives and I would like to extend my thanks to them all.

I remember when I was in high school in my last year, I asked my Arabic language teacher (Mr. Taibbi), what his thoughts were with regards to the new LMD system (Licence, Master, Doctorate). He replied that it is a good system, and he advised me to engage in it, and he told me a phrase that has remained with  me since: إذهب و أدرس حتى تتدكتر (Idhhab Wadrosse Hatta Tatadakktare) ”Go and study until you become a Doctor’‘. I think he influenced me greatly.  I finished my studies and got my Ph.D. Thanks to my teacher Mr. Taibbi.

Why is this research important?
Approximate string matching is a fundamental and recurrent problem that arises in most computer science fields. We find it in several application areas, such as search engines, spelling correction in text editors (spell checker), in OCR systems, in cryptography, in cleaning of online data,  databases, autocomplete systems, etc. One important application area is in the field of bioinformatics where it allows for searching biological sequences, to assemble and align them. In general, approximate pattern matching is used in all domains that process a set of data which is constituted and defined on a given alphabet Σ.

What is your contribution to your field of research?
My contribution to the problem of approximate string matching is in three complementary directions:

  1. The approximate string matching in the dictionary. We proposed two solutions to this problem, the first one uses hash tables for k ≥ 2, the second uses a bidirectional data structure (the Trie and reverse Trie), and it is restricted to (k = 1). The two solutions are adaptable, without loss of performance, to the approximate string matching in a text.
  2. The approximate string matching for autocompletion, which is concerned with finding all suffixes of a given prefix that may contain errors. We gave a new solution that is better in practice than previously proposed solutions.
  3. The problem of the alignment of biological sequences can be interpreted as an approximate string matching problem. We proposed a solution for peers and multiple sequences alignment.

All the results obtained showed that our algorithms give the best performance on sets of practical data benchmarked from the real world. All our methods are released as libraries, and they are available online. Here is the links:

About the Author

Ibrahim Chegrane author

Ibrahim Chegrane obtained his PhD from the University of Sciences and Technology Houari Boumediene (USTHB). Recently, he joined the Research Centre on Scientific and Technical Information (CERIST) to work as a researcher. His work focuses on Pattern matching, in his PhD Thesis he focuses on the particular topic of Approximate string matching, in his current work at CERIST he focuses more on the topic of Graph Matching.

1 Comment so far

Bersi MohandPosted on  2:23 pm - Mar 12, 2017

Great my brother

Leave a Reply