Mavuno

A Hadoop-Based Text Mining Toolkit

 

example #2: Sentence mining

example #3: distributional similarity

example #1: nlp processing of documents

example #6: Information extraction (coming soon)

example #4: harvesting class instances (coming soon)

example #5: semantic relation learning (coming soon)

Processing with Stanford Core NLP:


hadoop jar mavuno-VERSION.jar edu.isi.mavuno.app.nlp.ProcessStanfordNLP

-CorpusPath=data/wizard-of-oz.txt

-CorpusClass=edu.isi.mavuno.input.TextFileInputFormat

-OutputPath=data/wizard-of-oz-stanford -TextOutputFormat=true


Faster Processing with the Tratz Parser:


hadoop jar mavuno-VERSION.jar edu.isi.mavuno.app.nlp.TratzParse

-CorpusPath=data/wizard-of-oz.txt

-CorpusClass=edu.isi.mavuno.input.TextFileInputFormat

-OutputPath=data/wizard-of-oz-tratz -TextOutputFormat=true


By default, both ProcessStanfordNLP and TratzParse will write Hadoop SequenceFiles as output. To produce a text-based output format, set -TextOutputFormat=true.

Mavuno supports mining of sentences that match patterns, as represented using Java regular expressions. The file data/patterns.txt in the Mavuno distribution provides you with an example of the types of patterns that Mavuno supports.


The first two patterns in the file are:


yellow brick

kill(ed)?.*Witch.*(West|East)


These are lexical, or surface form, patterns that can be used to mine sentences from "plain text" input formats (e.g., TextFileInputFormat, TRECInputFormat, and LineInputFormat).


The other two patterns in the file are:


<sentence>\t1\tDorothy\t

\tPERSON\t.*\tLOCATION\t


These patterns match NLP processed sentences. The processed sentences have the following form:


<sentence>(\tPOSITION\tTERM\tCHAR_OFFSET_BEGIN\tCHAR_OFFSET_END\tLEMMA\tPOS_TAG\tCHUNK_TAG\tNER_TAG\tDEPEND_TYPE\tDEPEND_INDEX)*\t</sentence>


where:


  1. - POSITION is the (1-based) position within the sentence

  2. - TERM is the lexical form of the term

  3. - CHAR_OFFSET_BEGIN is the character offset of the beginning of the term in the original text

  4. - CHAR_OFFSET_END is the character offset of the end of the term in the original text

  5. - LEMMA is the term's lemma

  6. - POS_TAG is the term's part of speech tag

  7. - CHUNK_TAG is the term's chunk tag

  8. - NER_TAG is the term's named entity tag

  9. - DEPEND_TYPE is the dependency type that exists between this term and the one at DEPEND_INDEX

  10. - DEPEND_INDEX is the positional index of the current term's dependent


Documents processed by the Stanford NLP processor have an additional field per token which represents (via an integer) the coreference cluster of the term.


To mine the "plain text" patterns from the wizard-of-oz.txt file, run the following:


hadoop jar mavuno-VERSION.jar edu.isi.mavuno.app.mine.HarvestSentences

-PatternPath=data/patterns.txt -CorpusPath=data/wizard-of-oz.txt

-CorpusClass=edu.isi.mavuno.input.TextFileInputFormat

-OutputPath=data/sentences


To mine the more advanced patterns from the Tratz parsed documents, run the following:


hadoop jar mavuno-VERSION.jar edu.isi.mavuno.app.mine.HarvestSentences

-PatternPath=data/patterns.txt -CorpusPath=data/wizard-of-oz-tratz/

-CorpusClass=org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat

-OutputPath=data/sentences-tratz


The output of the mining process has the format PATTERN\tSENTENCE for each match in the input corpus.

Distributional similarity is a simple, but powerful method for measuring the similarity between objects. Mavuno supports generic distributional similarity operations by making use of two basic concepts -- patterns and contexts. Simply put, patterns are objects that occur "within" contexts. The precise definition of pattern and context depends on the application. For example, when generating synonyms or paraphrases, patterns are terms or phrases and contexts are the surrounding words. By imposing different definitions of patterns and contexts, a wide variety of tasks can be modeled. Technically speaking, any problem that propagates scores across a bipartite graph (with the partitions corresponding to the universe of patterns and contexts) can be implemented.


To provide maximum flexibility, Mavuno allows users to implement their own Extractors. Extractors are responsible for taking an input (e.g., a document) and producing (pattern, context) pairs. Given this notion of pattern and context, Mavuno has the following applications that support various distributional similarity / bipartite graph propagation algorithms (in the edu.isi.mavuno.app.distsim package).


  1. - PatternToContext -- given a set of patterns, identify all of the contexts that the pattern occurs in.

  2. - ContextToPattern -- given a set of contexts, identify all of the patterns that occur within each.

  3. - PatternToPattern -- composition of PatternToContext and ContextToPattern.

  4. - ContextToContext -- composition of ContextToPattern and PatternToContext.


By chaining these operations, it's possible to implement a variety of random walk algorithms over the (pattern, context) bipartite graph. The PatternToPattern application can be used for "standard" distributional similarity computations.


To make use of any of these applications, the following steps must be done.


1. Create an Examples File


You must create a file that contains the patterns and/or contexts to be used. This is called an examples file. An examples file is a text file of the following form:


ID\tPATTERN\tCONTEXT\tWEIGHT


where ID is an identifier, PATTERN is the pattern, CONTEXT is the context, and WEIGHT is the weight of the example. It should be noted that ID does not need to be unique. It is possible for multiple examples to be specified for a given ID. The file data/examples.txt is a sample examples file.


2. Convert Examples File to a SequenceFile


All of the applications above require that their inputs be Hadoop Sequence files. To convert an examples file to a SequenceFile, the following command should be run:


hadoop jar mavuno-VERSION.jar edu.isi.mavuno.app.util.ExamplesToSequenceFile

-InputPath=data/examples.txt -OutputPath=data/examples-seq


3. Run PatternToPattern


To mine the patterns that are distributionally similar to the input patterns, run the following:


hadoop jar mavuno-VERSION.jar edu.isi.mavuno.app.distsim.PatternToPattern

-PatternPath=data/examples-seq/ -CorpusPath=data/wizard-of-oz.txt

-CorpusClass=edu.isi.mavuno.input.TextFileInputFormat

-ExtractorClass=edu.isi.mavuno.extract.NGramExtractor

-ExtractorArgs="1:3:2:0:l*r" -MinMatches=1 -GlobalStats=true

-OutputPath=data/examples-distsim


4. Score the Output Patterns


Finally, the patterns can be scored (using PMI weighted cosine similarity) using the following command:


hadoop jar mavuno-VERSION.jar edu.isi.mavuno.app.distsim.ComputePatternScores

-InputPath=data/examples-distsim

-PatternScorerClass=edu.isi.mavuno.score.PMIScorer -PatternScorerArgs="pmi"

-ContextScorerClass=edu.isi.mavuno.score.PMIScorer -ContextScorerArgs="pmi"

-OutputPath=data/examples-scored


The scored patterns can be found in data/examples-scored/scored-patterns/part-r-00000. By further processing the output, it's possible to see that the 5 most similar terms to the input patterns are:


1 dorothy 5.419805817391787

1 the scarecrow 0.21643945788500285

1 the lion 0.1418366755604687

1 the girl 0.11421166173986792

1 the woodman 0.08984585843362136

2 tin man 9.851540335858612

2 tiger 1.7966332433453798

2 soldier 1.5389577443319944

2 voice 1.2900409851304906

2 scarecrow 1.2817847828277473

3 said 5.4604850456464735

3 answered 0.43232549644397644

3 replied 0.27724942657202095

3 asked 0.18405891439357025

3 returned 0.15980273736215703


The other pattern and context scorers included with Mavuno are the TFIDFScorer and LikelihoodScorer.


The application edu.isi.mavuno.driver.mine.HarvestParaphraseCandidates is an experimental implementation for harvesting all paraphrases from an input corpus. It should be used with caution.