|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectedu.cmu.minorthird.text.mixup.MixupProgram
public class MixupProgram
Modify a textlabeling using a series of mixup expressions.
BNF: STATEMENT -> declareSpanType TYPE STATEMENT -> provide ID STATEMENT -> require ID [,FILE] STATEMENT -> annotateWith FILE STATEMENT -> defDict [+case] NAME = ID, ... , ID STATEMENT -> defTokenProp PROP:VALUE = GEN STATEMENT -> defSpanProp PROP:VALUE = GEN STATEMENT -> defSpanType TYPE2 = GEN STATEMENT -> defLevel NAME = LEVELDEF STATEMENT -> onLevel NAME STATEMENT -> offLevel NAME STATEMENT -> importFromLevel NAME TYPE = TYPE LEVELDEF -> filter TYPE LEVELDEF -> pseudotoken TYPE LEVELDEF -> split TOKEN LEVELDEF -> re 'REGEX' GEN -> [TYPE]: MIXUP-EXPR GEN -> [TYPE]- MIXUP-EXPR GEN -> [TYPE]~ re 'REGEX',NUMBER GEN -> [TYPE]~ trie phrase1, phrase2, ... ; statements are semicolon-separated // and comments look like this (C++ style) SEMANTICS: execute each command in order, saving spans/tokens as types, and asserting properties '=:' can be replaced with '=TYPE:', in which case the expr will be applied to each span of the given type, rather than all top-level spans defDict FOO = bar,baz,bat stores a lowercase version of each word the dictionary defDict +case FOO = blah,Bar,baZ stores each word the dictionary, preserving case in dictionaries and tries, a double-quoted word "foo.txt" means to find foo.txt on the classpath and store all lines from the file as words (after trimming them). TYPE: MIXUP-EXPR finds all spans inside a span of type TYPE that match the expression TYPE- MIXUP-EXPR finds all spans inside a span of type TYPE that do not contain anything matching MIXUP-EXPR
Mixup is matching language for modifying TextLabels. It can label spans with a given TYPE (the new label for that token span) and assign properties to spans (much like labels, but 'invisible'). There is more documentation for Mixup programs in the package-level documents for Mixup.
Briefly, a Mixup program will look something like this:
require "req1"; //requires that "abc" type spans have already been labeled. If not, the default annoator
//for "abc" will be used.
require "req2", "req2.mixup";
//file 'def.mixup' will be run to provide "def" labels if they are not already there
//if "def" labels were already generated by a different annotator, they will be used and
//and 'def.mixup' won't be called.
provide "xyz"; //this program will annotate the text with "xyz" labels
defDict titleWord = mr, ms, mrs, dr;
//defines a dictionary (with scope of this program execution called 'titleWord'
//containing the values "mr", "ms", "mrs", "dr"
defDict myDictionary = "dictionary.txt";
//defines a dictionary called 'myDictionary' with values taken from the file "dictionary.txt"
defTokenProp title:true =: ... [ai(titleWord)] ... ; //finds all spans matching a work in the dictionary titleWord
//those spans are given the property "Name" with value "true" (a string, not boolean)
//if the span previously had "Name" property with a different value, that is replaced
// the "..." before and after indicate that it doesn't matter what comes before or after the token
//to be labeled. if I said "=: [ai(titleWord)];" the document would need to be JUST a titleword.
defTokenProp titlePunc:1 =: ... title:true [','] ... || ... title:true ['.'] ... ;
//spans "." or "," preceeded by a title are given the property titlePunc with value "1"
//note that the entire '... title:true [','] ...' is an expression; or operators ("||") must be
// between expressions, not within them
defSpanType fullTitle =: ...[title:true titlePunc:1?R] ...;
//label a span as "fullTitle" if there is a title span optionally followed b a titlePunc span
//but not more than one (from the R)
defSpanType the =: ... [eqi('the')] ...;
//labels occurances of "the" ignoring case (eq = equals, adding i ignores case)
defTokenProp aProp:t =: ...[] ...;
/tokens which have the title=true property AND are labeled as req1
//are given the property aProp=t
defTokenProp address:x =: ... [@fullTitle any] !a(myDictionary) ...;
//label spans of one 'fullTitle' (the @ is needed
//before types) and the following token, whatever it is,
// which are followed by something other than a myDictionary word
defTokenProp capProp:on =req2: ... [re('^[A-Z]$')] ...;
//on spans of type req2, match tokens fitting the given regular expression
defSpanType listSet =: ... [address+R] ...;
//label as header spans of 1 or more address tokens, going all the way to
//right most possible token - example: blah address1 address2 address3 blah
// - will return three spans: "address3", "address2 address3", and "address1 address2 address3"
defSpanType adList =: ... [L address+ R] ...; //as above but only returns the longest span
defSpanType header =: [L address* R] ...;
//label longest span of 0 or more address tokens at the beginning of the document
defSpanType shortList =: ... [address{2,3}] ...; //label spans of 2 or 3 address tokens
defSpanType xyz =header: ...[capProp] ...; //providing the promised xyz labeling
//creates a new level where each document is a span with spanType
defLevel newLevel = filter spanType;
//creates a new level where tokens of spanType are combined into a single token
defLevel newLevel = pseudotoken spanType;
//creates a new level where the textBase is retokenized by splitting a a certain token
defLevel newLevel = split '.';
//create a new level where the textBase is retokenized using a regular expression
defLevel newLevel = re '([^\n]+)';
//switches current textBase and Labels to Level
onLevel levelName;
//returns to root (or original) level - levelName is the name of the child level which you are switching off
offLevel childLevelName;
//Imports spans of Type in the child level to spans of newType in the parent level
importFromLevel childLevelName newType = type;
| Field Summary | |
|---|---|
static java.util.Set<java.lang.String> |
legalKeywords
|
| Constructor Summary | |
|---|---|
MixupProgram()
|
|
MixupProgram(java.io.File file)
Create a MixupProgram from the contents of a file. |
|
MixupProgram(java.lang.String program)
Create a MixupProgram from single string with a bunch of semicolon-separated statements. |
|
MixupProgram(java.lang.String[] statements)
Create a MixupProgram from an array of statements |
|
| Method Summary | |
|---|---|
void |
addStatement(Mixup.MixupTokenizer tok,
java.lang.String keyword)
Add a single statement to the current mixup program. |
void |
addStatement(java.lang.String statement)
Add a single statement to the current mixup program. |
Statement[] |
getStatements()
|
static void |
main(java.lang.String[] args)
usage: programFile textFile/directory [outfile] evaluates the given program file against the specified data (either a file or directory of files) if an outfile is specified it outputs the types as operators to that file |
java.lang.String |
toString()
List the program |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait |
| Field Detail |
|---|
public static java.util.Set<java.lang.String> legalKeywords
| Constructor Detail |
|---|
public MixupProgram()
public MixupProgram(java.lang.String[] statements)
throws Mixup.ParseException
Mixup.ParseException
public MixupProgram(java.lang.String program)
throws Mixup.ParseException
Mixup.ParseException
public MixupProgram(java.io.File file)
throws Mixup.ParseException,
java.io.FileNotFoundException,
java.io.IOException
Mixup.ParseException
java.io.FileNotFoundException
java.io.IOException| Method Detail |
|---|
public void addStatement(Mixup.MixupTokenizer tok,
java.lang.String keyword)
throws Mixup.ParseException
Mixup.ParseException
public void addStatement(java.lang.String statement)
throws Mixup.ParseException
Mixup.ParseExceptionpublic Statement[] getStatements()
public java.lang.String toString()
toString in class java.lang.Objectpublic static void main(java.lang.String[] args)
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||