Search
Text Search is a critical part of any company's success and is more so in the Information Governance arena. Indexing hundreds of billions of documents, and dealing with complex, millions of queries day-in and day-out is the order of the day for anyone in Compliance or Legal discovery teams. The ability to get to the documents in need in a lightning speed is very essential for effective governance.
Large scale search engine deployments based on Lucene or its off-shoots SOLR/ElasticSearch is on the rise. The index built by these API or frameworks are "Inverted Indexes" - the most obvious example is the page index on the back pages of a book where every word listed gets associated with the pages where they are found.
There are various driving factors for an effective index design:
a. Indexing speed
b. Index compactness
c. Search speed
d. Latency that influences near-time search
e. Support complex search patterns
f. Operational aspects of manageability, scalability, availability, mergeability, search within and search across indices etc.,
By design, Lucene index is a combination of many segments and each segment is an index in itself lending itself for parallelism. Lucene index can be created in memory (RAMDirectory) or in a file system directory (FSDirectory). An index data structure is a bunch of operating system files at that destination directory.
To create a search index, Lucene provides the following classes:
a. Directory - index destination
b. IndexWriter - that takes in an Analyzer
c. Analyzer - that analyzes the tokens as needed by the application
d. Document - a collection of Fields (that get stored or indexed or both)
e. Field - that which is being indexed or stored or both
To customize the behavior of index creation, the following can be extended or manipulated:
a. Directory - to manipulate the index destination
b. Custom readers, tokenizers and analyzers / PerFieldAnalyzerWrapper - to manipulate the token stream to index on
To read an index, Lucene provides the following classes:
a. Directory - index destination
b. IndexReader
c. Query - supports various types of query constructs such as Term, Phrase, Fuzzy, Proximity, WildCard etc.,
d. IndexSercher, TopDocs, ScoreDocs
Sample code snippet:
public class Main {
private final Version version = Version.LUCENE_45;
public void indexAndSearch() throws IOException {
int limit = 20;
try (IndexReader reader = DirectoryReader.open(index)) {
Query query = new PrefixQuery(new Term("email", "ale"));
hits(limit, query, reader);
query = new TermQuery(new Term("tel", "510713215"));
hits(limit, query, reader);
query = new TermQuery(new WildcardQuery("login", "u*"));
hits(limit, query, reader);
}
index.close();
}
private void hits(final int limit, final Query query, final IndexReader reader) throws IOException {
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(query, limit);
System.out.println(docs.totalHits + " found for query: " + query);
for (final ScoreDoc scoreDoc : docs.scoreDocs) {
System.out.println(searcher.doc(scoreDoc.doc));
}
}
public static void main(final String[] args) throws IOException {
new Main().indexAndSearch();
}
}
Some useful references:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
https://www.ibm.com/developerworks/library/os-apache-lucenesearch/
https://citrine.io/2015/02/15/building-a-custom-analyzer-in-lucene/
http://lucene.472066.n3.nabble.com/Lucene-4-0-PerFieldAnalyzerWrapper-question-td4010355.html
https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene
https://engineering.linkedin.com/search/did-you-mean-galene
http://solr-vs-elasticsearch.com/
https://nlp.stanford.edu/IR-book/html/htmledition/contents-1.html
http://blog.trifork.com/2012/01/22/query-time-joining-in-lucene/
http://www.ideaeng.com/where-all-filters-gone-0403
http://crd-legacy.lbl.gov/~kewu/ps/LBNL-59952.pdf
http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
https://web.stanford.edu/class/cs276/handouts/Lucene-1-per-page.pdf (***)
http://opensourceconnections.com/blog/2013/02/21/lucene-4-finite-state-automaton-in-10-minutes-intro-tutorial/
Lucene Intro
http://epaperpress.com/sortsearch/download/skiplist.pdf
”Information retrieval is
the activity of obtaining information resources (in the form of documents) relevant to an
information need
from a
collection of information resources.
Searches can be based on metadata or on full-text (or
other content-based) indexing”
- Wikipedia
Large scale search engine deployments based on Lucene or its off-shoots SOLR/ElasticSearch is on the rise. The index built by these API or frameworks are "Inverted Indexes" - the most obvious example is the page index on the back pages of a book where every word listed gets associated with the pages where they are found.
There are various driving factors for an effective index design:
a. Indexing speed
b. Index compactness
c. Search speed
d. Latency that influences near-time search
e. Support complex search patterns
f. Operational aspects of manageability, scalability, availability, mergeability, search within and search across indices etc.,
By design, Lucene index is a combination of many segments and each segment is an index in itself lending itself for parallelism. Lucene index can be created in memory (RAMDirectory) or in a file system directory (FSDirectory). An index data structure is a bunch of operating system files at that destination directory.
To create a search index, Lucene provides the following classes:
a. Directory - index destination
b. IndexWriter - that takes in an Analyzer
c. Analyzer - that analyzes the tokens as needed by the application
d. Document - a collection of Fields (that get stored or indexed or both)
e. Field - that which is being indexed or stored or both
To customize the behavior of index creation, the following can be extended or manipulated:
a. Directory - to manipulate the index destination
b. Custom readers, tokenizers and analyzers / PerFieldAnalyzerWrapper - to manipulate the token stream to index on
To read an index, Lucene provides the following classes:
a. Directory - index destination
b. IndexReader
c. Query - supports various types of query constructs such as Term, Phrase, Fuzzy, Proximity, WildCard etc.,
d. IndexSercher, TopDocs, ScoreDocs
Sample code snippet:
public class Main {
private final Version version = Version.LUCENE_45;
public void indexAndSearch() throws IOException {
Directory index = new RAMDirectory();
Map<String, Analyzer> analyzerPerField = new HashMap<String, Analyzer>();
analyzerPerField.put("email", new EmailAnalyzer());
analyzerPerField.put("tel", new TelAnalyser(version));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(
new StandardAnalyzer(version), analyzerPerField);
IndexWriterConfig config = new IndexWriterConfig(version, analyzer)
.setOpenMode(OpenMode.CREATE);
IndexWriter writer = new IndexWriter(index, config);
Document doc = new Document();
doc.add(new TextField("login", "user1", Store.YES));
doc.add(new TextField("email", "alex@zl.com", Store.YES));
doc.add(new TextField("email", "baron@zl.com", Store.YES));
doc.add(new TextField("tel", "+1(510)-713-2215", Store.YES));
writer.addDocument(doc);
writer.commit();
writer.close();
int limit = 20;
try (IndexReader reader = DirectoryReader.open(index)) {
Query query = new PrefixQuery(new Term("email", "ale"));
hits(limit, query, reader);
query = new TermQuery(new Term("tel", "510713215"));
hits(limit, query, reader);
query = new TermQuery(new WildcardQuery("login", "u*"));
hits(limit, query, reader);
}
index.close();
}
private void hits(final int limit, final Query query, final IndexReader reader) throws IOException {
IndexSearcher searcher = new IndexSearcher(reader);
TopDocs docs = searcher.search(query, limit);
System.out.println(docs.totalHits + " found for query: " + query);
for (final ScoreDoc scoreDoc : docs.scoreDocs) {
System.out.println(searcher.doc(scoreDoc.doc));
}
}
public static void main(final String[] args) throws IOException {
new Main().indexAndSearch();
}
}
Some useful references:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
https://www.ibm.com/developerworks/library/os-apache-lucenesearch/
https://citrine.io/2015/02/15/building-a-custom-analyzer-in-lucene/
http://lucene.472066.n3.nabble.com/Lucene-4-0-PerFieldAnalyzerWrapper-question-td4010355.html
https://github.com/apache/lucene-solr/tree/master/lucene/core/src/java/org/apache/lucene
https://engineering.linkedin.com/search/did-you-mean-galene
http://solr-vs-elasticsearch.com/
https://nlp.stanford.edu/IR-book/html/htmledition/contents-1.html
http://blog.trifork.com/2012/01/22/query-time-joining-in-lucene/
http://www.ideaeng.com/where-all-filters-gone-0403
http://crd-legacy.lbl.gov/~kewu/ps/LBNL-59952.pdf
http://blog.mikemccandless.com/2010/12/using-finite-state-transducers-in.html
https://web.stanford.edu/class/cs276/handouts/Lucene-1-per-page.pdf (***)
http://opensourceconnections.com/blog/2013/02/21/lucene-4-finite-state-automaton-in-10-minutes-intro-tutorial/
Lucene Intro
http://epaperpress.com/sortsearch/download/skiplist.pdf
Comments
Post a Comment