Searching for Search

Search

”Information retrieval is the activity of obtaining information resources (in the form of documents) relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing”

- Wikipedia

Text Search is a critical part of any company's success and is more so in the Information Governance arena. Indexing hundreds of billions of documents, and dealing with complex, millions of queries day-in and day-out is the order of the day for anyone in Compliance or Legal discovery teams. The ability to get to the documents in need in a lightning speed is very essential for effective governance.

Large scale search engine deployments based on Lucene or its off-shoots SOLR/ElasticSearch is on the rise. The index built by these API or frameworks are "Inverted Indexes" - the most obvious example is the page index on the back pages of a book where every word listed gets associated with the pages where they are found.

There are various driving factors for an effective index design:

a. Indexing speed
b. Index compactness
c. Search speed
d. Latency that influences near-time search
e. Support complex search patterns
f. Operational aspects of manageability, scalability, availability, mergeability, search within and search across indices etc.,

By design, Lucene index is a combination of many segments and each segment is an index in itself lending itself for parallelism. Lucene index can be created in memory (RAMDirectory) or in a file system directory (FSDirectory). An index data structure is a bunch of operating system files at that destination directory.

To create a search index, Lucene provides the following classes:

a. Directory - index destination
b. IndexWriter - that takes in an Analyzer
c. Analyzer - that analyzes the tokens as needed by the application
d. Document - a collection of Fields (that get stored or indexed or both)
e. Field - that which is being indexed or stored or both

To customize the behavior of index creation, the following can be extended or manipulated:

a. Directory - to manipulate the index destination
b. Custom readers, tokenizers and analyzers / PerFieldAnalyzerWrapper - to manipulate the token stream to index on

To read an index, Lucene provides the following classes:

a. Directory - index destination
b. IndexReader
c. Query - supports various types of query constructs such as Term, Phrase, Fuzzy, Proximity, WildCard etc.,
d. IndexSercher, TopDocs, ScoreDocs

Sample code snippet:

public class Main {
private final Version version = Version.LUCENE_45;

public void indexAndSearch() throws IOException {

Directory index = new RAMDirectory();

Map<String, Analyzer> analyzerPerField = new HashMap<String, Analyzer>();

analyzerPerField.put("email", new EmailAnalyzer());

analyzerPerField.put("tel", new TelAnalyser(version));

PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(

new StandardAnalyzer(version), analyzerPerField);

IndexWriterConfig config = new IndexWriterConfig(version, analyzer)

.setOpenMode(OpenMode.CREATE);

IndexWriter writer = new IndexWriter(index, config);

Document doc = new Document();

doc.add(new TextField("login", "user1", Store.YES));

doc.add(new TextField("email", "alex@zl.com", Store.YES));

doc.add(new TextField("email", "baron@zl.com", Store.YES));

doc.add(new TextField("tel", "+1(510)-713-2215", Store.YES));

writer.addDocument(doc);

writer.commit();

writer.close();

int limit = 20;
try (IndexReader reader = DirectoryReader.open(index)) {

Query query = new PrefixQuery(new Term("email", "ale"));
hits(limit, query, reader);

query = new TermQuery(new Term("tel", "510713215"));
hits(limit, query, reader);

query = new TermQuery(new WildcardQuery("login", "u*"));
hits(limit, query, reader);
}

index.close();
}

private void hits(final int limit, final Query query, final IndexReader reader) throws IOException {

IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(query, limit);

System.out.println(docs.totalHits + " found for query: " + query);

for (final ScoreDoc scoreDoc : docs.scoreDocs) {
System.out.println(searcher.doc(scoreDoc.doc));
}
}

public static void main(final String[] args) throws IOException {
new Main().indexAndSearch();
}
}

Some useful references:

https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
https://www.ibm.com/developerworks/library/os-apache-lucenesearch/
https://citrine.io/2015/02/15/building-a-custom-analyzer-in-lucene/
http://lucene.472066.n3.nabble.com/Lucene-4-0-PerFieldAnalyzerWrapper-question-td4010355.html
https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene
https://engineering.linkedin.com/search/did-you-mean-galene
http://solr-vs-elasticsearch.com/
https://nlp.stanford.edu/IR-book/html/htmledition/contents-1.html
http://blog.trifork.com/2012/01/22/query-time-joining-in-lucene/
http://www.ideaeng.com/where-all-filters-gone-0403
http://crd-legacy.lbl.gov/~kewu/ps/LBNL-59952.pdf
http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
https://web.stanford.edu/class/cs276/handouts/Lucene-1-per-page.pdf (***)
http://opensourceconnections.com/blog/2013/02/21/lucene-4-finite-state-automaton-in-10-minutes-intro-tutorial/
Lucene Intro
http://epaperpress.com/sortsearch/download/skiplist.pdf

Random Thoughts

Search This Blog

Searching for Search

Comments

Post a Comment

Popular posts from this blog

Information Governance vs Data Governance

Defensible Disposition - Purge; Don't splurge in legal costs!

Parser Generator