List

I pro­vide a basic index­ing and retrieval code using the PyLucene 3.0 API.Lucene In Action (2nd Ed) cov­ers Lucene 3.0, but the PyLucene code sam­ples for have not been updated for the 3.0 API, only the Java ones. Unfor­tu­nately, there is cur­rently lit­tle (no?) exam­ple PyLucene code in blo­gos­phere. If you have links to more Lucene 3.0 tuto­ri­als and sam­ples, please share them in the comments.

SAM­PLE PYLUCENE 3.0 CODE

In the spirit of Lingpipe’s Lucene 2.4 in 60 sec­onds, here are rel­e­vant PyLucene 3.0 code snip­pets from my biased-text-sample project, for index­ing and retrieval.

INDEX­ING

import lucene
from lucene import 
    SimpleFSDirectory, System, File, 
    Document, Field, StandardAnalyzer, IndexWriter, Version

if __name__ == "__main__":
    lucene.initVM()
    indexDir = "/Tmp/REMOVEME.index-dir"
    dir = SimpleFSDirectory(File(indexDir))
    analyzer = StandardAnalyzer(Version.LUCENE_30)
    writer = IndexWriter(dir, analyzer, True, IndexWriter.MaxFieldLength(512))

    print >> sys.stderr, "Currently there are %d documents in the index..." % writer.numDocs()

    print >> sys.stderr, "Reading lines from sys.stdin..."
    for l in sys.stdin:
        doc = Document()
        doc.add(Field("text", l, Field.Store.YES, Field.Index.ANALYZED))
        writer.addDocument(doc)

    print >> sys.stderr, "Indexed lines from stdin (%d documents in index)" % (writer.numDocs())
    print >> sys.stderr, "About to optimize index of %d documents..." % writer.numDocs()
    writer.optimize()
    print >> sys.stderr, "...done optimizing index of %d documents" % writer.numDocs()
    print >> sys.stderr, "Closing index of %d documents..." % writer.numDocs()
    writer.close()
    print >> sys.stderr, "...done closing index of %d documents" % writer.numDocs()

RETRIEVAL

import lucene
from lucene import 
    SimpleFSDirectory, System, File, 
    Document, Field, StandardAnalyzer, IndexSearcher, Version, QueryParser

if __name__ == "__main__":
    lucene.initVM()
    indexDir = "/Tmp/REMOVEME.index-dir"
    dir = SimpleFSDirectory(File(indexDir))
    analyzer = StandardAnalyzer(Version.LUCENE_30)
    searcher = IndexSearcher(dir)

    query = QueryParser(Version.LUCENE_30, "text", analyzer).parse("Find this sentence please")
    MAX = 1000
    hits = searcher.search(query, MAX)

    print "Found %d document(s) that matched query '%s':" % (hits.totalHits, query)

    for hit in hits.scoreDocs:
        print hit.score, hit.doc, hit.toString()
        doc = searcher.doc(hit.doc)
        print doc.get("text").encode("utf-8")

 

The post PYLUCENE 3.0 IN 60 SECONDS — TUTORIAL and SAMPLE CODE appeared first on The Big Data Blog.

Source: PYLUCENE 3.0 IN 60 SECONDS — TUTORIAL and SAMPLE CODE

Leave a Reply

Your email address will not be published. Required fields are marked *

  Posts

1 2 3
February 17th, 2016

Kaggle Competition Past Winner Solutions

We learn more from code, and from great code. Not necessarily always the 1st ranking solution, because we also learn […]

February 7th, 2016

Installing Kafka on Mac OSX

Apache Kafka is a highly-scalable publish-subscribe messaging system that can serve as the data backbone in distributed applications. With Kafka’s […]

February 5th, 2016

Lucene In-Memory Search Example and Sample Code

More sample code: https://github.com/fnp/pylucene/tree/master/samples/LuceneInAction  Sample code import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.*; […]

February 5th, 2016

PYLUCENE 3.0 IN 60 SECONDS — TUTORIAL and SAMPLE CODE

I pro­vide a basic index­ing and retrieval code using the PyLucene 3.0 API.Lucene In Action (2nd Ed) cov­ers Lucene 3.0, but […]

January 29th, 2016

NiFi: Thinking Differently About DataFlow

Recently a question was posed to the Apache NiFi (Incubating) Developer Mailing List about how best to use Apache NiFi […]

January 29th, 2016

Apache Nifi (aka HDF) data flow across data center

Short Description: This article provides a step by step overview of how to setup cross data center data flow using […]

January 24th, 2016

Accurately Measuring Model Prediction Error

When assessing the quality of a model, being able to accurately measure its prediction error is of key importance. Often, […]

January 9th, 2016

TIME SERIES FORECASTING – TAKING KAGGLE ROSSMANN CHALLENGE AS EXAMPLE

A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Forecasting […]

January 7th, 2016

Getting Started with Markov Chains

There are number of R packages devoted to sophisticated applications of Markov chains. These include msm and SemiMarkov for fitting […]

December 26th, 2015

Hadoop filesystem at Twitter

Twitter runs multiple large Hadoop clusters that are among the biggest in the world. Hadoop is at the core of […]