Mavuno

A Hadoop-Based Text Mining Toolkit

 

supported input formats

applications

pattern/context extractors

nlp tool support

installation

Download Mavuno:


1. Unpack the Mavuno archive or clone the Mavuno github repository.


To compile the code, build the Mavuno jar, and set up the Hadoop classpath:


  1. 2.Run ant jar from the Mavuno directory to compile the code and build the Mavuno jar.

  2. 3.Add the jars in the lib directory to the Hadoop classpath. The easiest way to do this is to copy the jars to your $HADOOP_HOME/lib/ directory.


To run Mavuno applications:


  1. 4.Run hadoop jar ivory-VERSION.jar APPCLASS OPTIONS


where VERSION is the current version of Mavuno, APPCLASS is the class of the application to run, and OPTIONS are any command line options that may be required by the application.

Applicable to all text documents:


  1. - CooccurExtractor

  2. - PassageExtractor

  3. - NGramExtractor


Applicable to NLP processed documents:


  1. - ChunkExtractor

  2. - DIRTExtractor

  3. - NAryChunkExtractor


Applicable to Twitter JSON documents:


  1. - TwitterCooccurExtractor

  2. - TwitterGeoTemporalExtractor


Other extractors:


  1. - MultiExtractor - Allows multiple extractors to be applied per document.

  1. - ClueWarcInputFormat (ClueWeb)

  2. - LineInputFormat (one input per line)

  3. - TextFileInputFormat (text files)

  4. - TrecInputFormat (TREC-style documents)

  5. - TwitterInputFormat (Twitter JSON)

Basic Text Mining (edu.isi.mavuno.app.mine):


  1. - HarvestSentences

  2. - HarvestContextPatternPairs

  3. - HarvestParaphraseCandidates


Distributional Similarity (edu.isi.mavuno.app.distsim):


  1. - ComputeContextScores

  2. - ComputePatternScores

  3. - ContextToContext

  4. - ContextToPattern

  5. - PatternToContext

  6. - PatternToPattern


NLP (edu.isi.mavuno.app.nlp):


  1. - HarvestParseGraph

  2. - ProcessStanfordNLP

  3. - TratzParse


Information Extraction (edu.isi.mavuno.app.ie):


  1. - ExtractRelations

  2. - HarvestEspressoContexts

  3. - HarvestEspressoPatterns

  4. - HarvestSAPInstances

  5. - HarvestUDAPInstances


Utilities (edu.isi.mavuno.app.util):


  1. - ExamplesToSequenceFile

  2. - SequenceFileToText

application parameters

Some applications require that one or more parameters be specified (e.g., input paths, output paths, etc.). Mavuno supports two ways of specifying these parameters:


Specifying parameters on the command line:


-PARAMETER=VALUE


where PARAMETER is the name of the parameter/option and VALUE is the desired value.


Specifying parameters with parameter files:


Each line of a parameter file takes the following form:


PARAMETER [TAB] VALUE


where PARAMETER is the name of the parameter/option, VALUE is the desired value, and [TAB] denotes the tab ("\t") character.

javadoc

The Mavuno javadoc is available here.