MixupProgram

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.cmu.minorthird.text.mixup
Class MixupProgram

java.lang.Object
  edu.cmu.minorthird.text.mixup.MixupProgram

All Implemented Interfaces:: java.io.Serializable

public class MixupProgram
extends java.lang.Object
implements java.io.Serializable
extends java.lang.Object
implements java.io.Serializable

Modify a textlabeling using a series of mixup expressions.

 BNF:
 STATEMENT -> declareSpanType TYPE
 STATEMENT -> provide ID
 STATEMENT -> require ID [,FILE]
 STATEMENT -> annotateWith FILE
 STATEMENT -> defDict [+case] NAME = ID, ... , ID
 STATEMENT -> defTokenProp PROP:VALUE = GEN
 STATEMENT -> defSpanProp PROP:VALUE = GEN
 STATEMENT -> defSpanType TYPE2 = GEN
 STATEMENT -> defLevel NAME = LEVELDEF
 STATEMENT -> onLevel NAME
 STATEMENT -> offLevel NAME
 STATEMENT -> importFromLevel NAME TYPE = TYPE

 LEVELDEF -> filter TYPE
 LEVELDEF -> pseudotoken TYPE
 LEVELDEF -> split TOKEN
 LEVELDEF -> re 'REGEX'

 GEN -> [TYPE]: MIXUP-EXPR
 GEN -> [TYPE]- MIXUP-EXPR
 GEN -> [TYPE]~ re 'REGEX',NUMBER
 GEN -> [TYPE]~ trie phrase1, phrase2, ... ;

 statements are semicolon-separated
 // and comments look like this (C++ style)

 SEMANTICS:
 execute each command in order, saving spans/tokens as types, and asserting properties
 '=:' can be replaced with '=TYPE:', in which case the expr will be applied to
 each span of the given type, rather than all top-level spans

 defDict FOO = bar,baz,bat stores a lowercase version of each word the dictionary
 defDict +case FOO = blah,Bar,baZ stores each word the dictionary, preserving case

 in dictionaries and tries, a double-quoted word "foo.txt" means to
 find foo.txt on the classpath and store all lines from the file as
 words (after trimming them).

 TYPE: MIXUP-EXPR finds all spans inside a span of type TYPE that match the expression
 TYPE- MIXUP-EXPR finds all spans inside a span of type TYPE that do not contain anything matching MIXUP-EXPR

Mixup is matching language for modifying TextLabels. It can label spans with a given TYPE (the new label for that token span) and assign properties to spans (much like labels, but 'invisible'). There is more documentation for Mixup programs in the package-level documents for Mixup.

Briefly, a Mixup program will look something like this:

 require "req1"; //requires that "abc" type spans have already been labeled.  If not, the default annoator
 //for "abc" will be used.
 require "req2", "req2.mixup"; 
 //file 'def.mixup' will be run to provide "def" labels if they are not already there
 //if  "def" labels were already generated by a different annotator, they will be used and
 //and 'def.mixup' won't be called.
 provide "xyz"; //this program will annotate the text with "xyz" labels
 defDict titleWord = mr, ms, mrs, dr; 
 //defines a dictionary (with scope of this program execution called 'titleWord'
 //containing the values "mr", "ms", "mrs", "dr" 
 defDict myDictionary = "dictionary.txt"; 
 //defines a dictionary called 'myDictionary' with values taken from the file "dictionary.txt"
 defTokenProp title:true =: ... [ai(titleWord)] ... ; //finds all spans matching a work in the dictionary titleWord
 //those spans are given the property "Name" with value "true" (a string, not boolean)
 //if the span previously had "Name" property with a different value, that is replaced
 // the "..." before and after indicate that it doesn't matter what comes before or after the token
 //to be labeled.  if I said "=: [ai(titleWord)];" the document would need to be JUST a titleword.
 defTokenProp titlePunc:1 =: ... title:true [','] ... || ... title:true ['.'] ... ;
 //spans "." or "," preceeded by a title are given the property titlePunc with value "1"
 //note that the entire '... title:true [','] ...' is an expression; or operators ("||") must be
 // between expressions, not within them
 defSpanType fullTitle =: ...[title:true titlePunc:1?R] ...;
 //label a span as "fullTitle" if there is a title span optionally followed b a titlePunc span
 //but not more than one (from the R)
 defSpanType the =: ... [eqi('the')] ...; 
 //labels occurances of "the" ignoring case (eq = equals, adding i ignores case)
 defTokenProp aProp:t =: ...[] ...; 
 /tokens which have the title=true property AND are labeled as req1
 //are given the property aProp=t
 defTokenProp address:x =: ... [@fullTitle any] !a(myDictionary) ...; 
 //label spans of one 'fullTitle' (the @ is needed
 //before types) and the following token, whatever it is, 
 // which are followed by something other than a myDictionary word
 defTokenProp capProp:on =req2: ... [re('^[A-Z]$')] ...; 
 //on spans of type req2, match tokens fitting the given regular expression
 defSpanType listSet =: ... [address+R] ...; 
 //label as header spans of 1 or more address tokens, going all the way to 
 //right most possible token - example: blah address1 address2 address3 blah 
 // - will return three spans: "address3", "address2 address3", and "address1 address2 address3"
 defSpanType adList =: ... [L address+ R] ...; //as above but only returns the longest span
 defSpanType header =: [L address* R] ...; 
 //label longest span of 0 or more address tokens at the beginning of the document
 defSpanType shortList =: ... [address{2,3}] ...; //label spans of 2 or 3 address tokens
 defSpanType xyz =header: ...[capProp] ...; //providing the promised xyz labeling
 //creates a new level where each document is a span with spanType
 defLevel newLevel = filter spanType;
 //creates a new level where tokens of spanType are combined into a single token
 defLevel newLevel = pseudotoken spanType;
 //creates a new level where the textBase is retokenized by splitting a a certain token
 defLevel newLevel = split '.';
 //create a new level where the textBase is retokenized using a regular expression
 defLevel newLevel = re '([^\n]+)';
 //switches current textBase and Labels to Level
 onLevel levelName;
 //returns to root (or original) level - levelName is the name of the child level which you are switching off
 offLevel childLevelName;
 //Imports spans of Type in the child level to spans of newType in the parent level
 importFromLevel childLevelName newType = type;

Author:: William Cohen
See Also:: Serialized Form

Field Summary
`static java.util.Set<java.lang.String>`	`legalKeywords`

Constructor Summary
`MixupProgram()`
`MixupProgram(java.io.File file)` Create a MixupProgram from the contents of a file.
`MixupProgram(java.lang.String program)` Create a MixupProgram from single string with a bunch of semicolon-separated statements.
`MixupProgram(java.lang.String[] statements)` Create a MixupProgram from an array of statements

Method Summary
`void`	`addStatement(Mixup.MixupTokenizer tok, java.lang.String keyword)` Add a single statement to the current mixup program.
`void`	`addStatement(java.lang.String statement)` Add a single statement to the current mixup program.
`Statement[]`	`getStatements()`
`static void`	`main(java.lang.String[] args)` usage: programFile textFile/directory [outfile] evaluates the given program file against the specified data (either a file or directory of files) if an outfile is specified it outputs the types as operators to that file
`java.lang.String`	`toString()` List the program

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait`

Field Detail

legalKeywords

public static java.util.Set<java.lang.String> legalKeywords

Constructor Detail

MixupProgram

public MixupProgram()

MixupProgram

public MixupProgram(java.lang.String[] statements)
             throws Mixup.ParseException

Create a MixupProgram from an array of statements

Throws:: Mixup.ParseException

MixupProgram

public MixupProgram(java.lang.String program)
             throws Mixup.ParseException

Create a MixupProgram from single string with a bunch of semicolon-separated statements.

Throws:: Mixup.ParseException

MixupProgram

public MixupProgram(java.io.File file)
             throws Mixup.ParseException,
                    java.io.FileNotFoundException,
                    java.io.IOException

Create a MixupProgram from the contents of a file.

Throws:: Mixup.ParseException; java.io.FileNotFoundException; java.io.IOException

Method Detail

addStatement

public void addStatement(Mixup.MixupTokenizer tok,
                         java.lang.String keyword)
                  throws Mixup.ParseException

Add a single statement to the current mixup program.

Throws:: Mixup.ParseException

addStatement

public void addStatement(java.lang.String statement)
                  throws Mixup.ParseException

Add a single statement to the current mixup program.

Throws:: Mixup.ParseException

getStatements

public Statement[] getStatements()

toString

public java.lang.String toString()

List the program

Overrides:: toString in class java.lang.Object

main

public static void main(java.lang.String[] args)

usage: programFile textFile/directory [outfile] evaluates the given program file against the specified data (either a file or directory of files) if an outfile is specified it outputs the types as operators to that file

Overview

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

edu.cmu.minorthird.text.mixup Class MixupProgram

legalKeywords

MixupProgram

MixupProgram

MixupProgram

MixupProgram

addStatement

addStatement

getStatements

toString

main

edu.cmu.minorthird.text.mixup
Class MixupProgram