Skip to main content

Searching for Search

Search

Information retrieval is the activity of obtaining information resources (in the form of documents) relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text (or other content-based) indexing

  - Wikipedia

Text Search is a critical part of any company's success and is more so in the Information Governance arena. Indexing hundreds of billions of documents, and dealing with complex, millions of queries day-in and day-out is the order of the day for anyone in Compliance or Legal discovery teams. The ability to get to the documents in need in a lightning speed is very essential for effective governance.

Large scale search engine deployments based on Lucene or its off-shoots SOLR/ElasticSearch is on the rise. The index built by these API or frameworks are  "Inverted Indexes" - the most obvious example is the page index on the back pages of a book where every word listed gets associated with the pages where they are found.


There are various driving factors for an effective index design:

a. Indexing speed
b. Index compactness
c. Search speed
d. Latency that influences near-time search
e. Support complex search patterns
f. Operational aspects of manageability, scalability, availability, mergeability, search within and search across indices etc.,

By design, Lucene index is a combination of many segments and each segment is an index in itself lending itself for parallelism. Lucene index can be created in memory (RAMDirectory) or in a file system directory (FSDirectory). An index data structure is a bunch of operating system files at that destination directory.

To create a search index, Lucene provides the following classes:

a. Directory - index destination
b. IndexWriter - that takes in an Analyzer
c. Analyzer - that analyzes the tokens as needed by the application
d. Document - a collection of Fields (that get stored or indexed or both)
e. Field - that which is being indexed or stored or both

To customize the behavior of index creation, the following can be extended or manipulated:

a. Directory - to manipulate the index destination
b. Custom readers, tokenizers and analyzers / PerFieldAnalyzerWrapper - to manipulate the token stream to index on

To read an index, Lucene provides the following classes:

a. Directory - index destination
b. IndexReader
c. Query - supports various types of query constructs such as Term, Phrase, Fuzzy, Proximity, WildCard etc.,
d. IndexSercher, TopDocs, ScoreDocs

Sample code snippet:

public class Main {
private final Version version = Version.LUCENE_45;

public void indexAndSearch() throws IOException {



Directory index = new RAMDirectory();
Map<String, Analyzer> analyzerPerField = new HashMap<String, Analyzer>();
analyzerPerField.put("email", new EmailAnalyzer());
analyzerPerField.put("tel", new TelAnalyser(version));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(
new StandardAnalyzer(version), analyzerPerField);
     IndexWriterConfig config = new IndexWriterConfig(version, analyzer)
.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(index, config);
    Document doc = new Document();
doc.add(new TextField("login", "user1", Store.YES));
doc.add(new TextField("email", "alex@zl.com", Store.YES));
doc.add(new TextField("email", "baron@zl.com", Store.YES));
doc.add(new TextField("tel", "+1(510)-713-2215", Store.YES));
writer.addDocument(doc);
writer.commit();
writer.close();

int limit = 20;

try (IndexReader reader = DirectoryReader.open(index)) {

        Query query = new PrefixQuery(new Term("email", "ale"));
hits(limit, query, reader);

query = new TermQuery(new Term("tel", "510713215"));

hits(limit, query, reader);

query = new TermQuery(new WildcardQuery("login", "u*"));

hits(limit, query, reader);
}

index.close();

}

private void hits(final int limit, final Query query, final IndexReader reader) throws IOException {


IndexSearcher searcher = new IndexSearcher(reader);

TopDocs docs = searcher.search(query, limit);

System.out.println(docs.totalHits + " found for query: " + query);


for (final ScoreDoc scoreDoc : docs.scoreDocs) {

System.out.println(searcher.doc(scoreDoc.doc));
}
}

public static void main(final String[] args) throws IOException {

new Main().indexAndSearch();
}
}

Some useful references:

https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
https://www.ibm.com/developerworks/library/os-apache-lucenesearch/
https://citrine.io/2015/02/15/building-a-custom-analyzer-in-lucene/
http://lucene.472066.n3.nabble.com/Lucene-4-0-PerFieldAnalyzerWrapper-question-td4010355.html
https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene
https://engineering.linkedin.com/search/did-you-mean-galene
http://solr-vs-elasticsearch.com/
https://nlp.stanford.edu/IR-book/html/htmledition/contents-1.html
http://blog.trifork.com/2012/01/22/query-time-joining-in-lucene/
http://www.ideaeng.com/where-all-filters-gone-0403
http://crd-legacy.lbl.gov/~kewu/ps/LBNL-59952.pdf
http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
https://web.stanford.edu/class/cs276/handouts/Lucene-1-per-page.pdf   (***)
http://opensourceconnections.com/blog/2013/02/21/lucene-4-finite-state-automaton-in-10-minutes-intro-tutorial/
Lucene Intro
http://epaperpress.com/sortsearch/download/skiplist.pdf

Comments

Popular posts from this blog

Information Governance vs Data Governance

Information Governance vs. Data Governance: Key Differences While Information Governance (IG) and Data Governance (DG) are closely related, they focus on different aspects of managing organizational assets. Here’s a simple breakdown of their differences and how they work together: 1. Definitions Data Governance (DG): - Focuses on managing **data as an asset**. - Ensures data is accurate, consistent, secure, and available for use. - Example: Defining who can access customer data and how it’s stored. Information Governance (IG): - Broader than DG, focusing on managing **all forms of information** (structured data, unstructured data, documents, emails, etc.). - Ensures information is used effectively, ethically, and in compliance with regulations. - Example: Setting policies for retaining and disposing of emails and documents. 2. Scope Data Governance: - Primarily deals with **structured data** (e.g., databases, spreadsheets). - Focuses on...

Defensible Disposition - Purge; Don't splurge in legal costs!

Defensible Disposition: A Smart Strategy for Compliance and Security What Is Defensible Disposition? Defensible disposition is the legally sound, systematic process of discarding or destroying records, documents, and data that are no longer needed for business, legal, or regulatory reasons. By implementing this practice, organizations ensure compliance with legal requirements while reducing risks associated with unnecessary data retention. Why It Matters: Key Principles ✅ Compliance: Aligns with laws, regulations, and industry standards to avoid legal and financial penalties. ✅ Accountability: Establishes a clear audit trail to document every step of the disposition process. ✅ Security: Protects sensitive information by ensuring secure destruction, preventing unauthorized access or breaches. ✅ Efficiency: Eliminates outdated records promptly, reducing storage costs and streamlining data management. How to Implement Defensible Disposition 1️⃣ Identify Records – Determine which reco...

Parser Generator

If you have a need for a parser generator for your applications say a SQL parser or an expression parser to plugin somewhere in your application, your obvious choices are ANTLR or JavaCC. ANTLR  (ANother Tool for Language Recognition) is a parser generator developed by Terence Parr, it is extremely versatile and is capable of generating parsers in multiple programming languages, but requires a run time library. The Definitive ANTLR 4 Reference is an excellent book to get into ANTLR. JavaCC  (Java Compiler Compiler) was written originally by Dr. Sriram Sankar and Sreenivasa Viswanadha. It is only capable of generating parsers in Java but doesn’t require a runtime library and the parsers it generates are very performant for an LL(k) parsers. Useful References: 1.   https://dzone.com/articles/antlr-and-javacc-parser-generators 2.  Work on JavaCC