Friday, August 17, 2012

Perl Tip: Don't Use a Makefile for Your Module

During my internship at SoarTech, I got a chance to learn a lot more about creating Perl modules. I put together a package of scripts for converting old file formats for speech recognition grammars, and I thought it worked beautifully. Of course, to start my module I used the classic tool h2xs:

    h2xs -X -n Foo::Bar

My final code was well tested, and quick to install. I asked my co-worker to install it on his machine, and it was just as easy to use.
I was confident when I presented it to the company, until someone asked, "so, these Perl scripts, they work on Windows, Mac, Linux, etc., right?" I told them they should, since Perl is cross-platform. I became (a good kind of) paranoid, and asked my boss to test my code on his Mac (I was using Windows). The thing exploded when fed my script! I couldn't believe it! What could I have done so wrong?
So, the next day I stayed home a bit to borrow my wife's Mac and do more testing. But there was a big problem: my module used ExtUtils::MakeMaker to install itself. This has been the standard for years, and the majority of CPAN modules use this for installation. The cpan utility recognizes it, and runs installation automatically. However, MakeMaker is DOOMED! It requires an external tool, make, which you can find on every *nix platform, but everywhere else it has to be installed by the user. Strawberry Perl and ActiveState Perl for Windows come with a version (dmake or nmake). But on Mac, you have to install XCode, a whopping 4 gigabyte distribution for Mac developers.

...

Dangit...

My solution was to follow Michael Schwern's advice and convert to using Module::Build, which does not have external dependencies. There happens to be a converter to help you switch. When I ran it on my code, it didn't give a completely valid output, but the edits I did were minimal. From a user standpoint, the module will still be installed using the cpan utility, so nothing has changed.

When I put new distribution, with a shiny new Build.PL file, on a Mac, I still had some failed tests, but there weren't intermingled with the giant BOOM that happens when the cpan utility can't find make. After fixing a bug or two, my module works on Windows and Mac and my boss is a happy camper.

Friday, June 29, 2012

Graphing grammar parses from CMU Sphinx 4

CMU Sphinx comes with some neat grammar parsing stuff that I never knew about. It uses JSGF (as detailed here) and comes with several demos, showing how to use a basic grammar, arc weights, tags, and even getting a javascript representation of the final parse! At work I've been needing to do some custom processing of the grammar output, but it was more conceptually difficult than I'd planned. So after figuring out how to traverse a parse tree, I decided to write a little application to print out the parse of a given sentence. The eclipse project for it can be found here.

Given this small gramamar:

#JSGF V1.0;
grammar sidTests ;


public <greet> = <greeting> [<person>] [i am <person>];


<greeting> = konnichiwa {language:japanese} | hello {language:english} | guten tag {language:german};


<person> = john {gender:man} | martha {gender:female} | kelly;

If we parse the sentence "konnichiwa kelly i am john", the program outputs the following:



digraph {
"greet-2147483647" [label="greet" color=magenta];
"greet-2147483647" -> "(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )-2147483646";
"(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )-2147483646" [label="(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )" color=green];
"(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )-2147483646" -> "greeting-2147483645";
"greeting-2147483645" [label="greeting" color=magenta];
"greeting-2147483645" -> "konnichiwa {language:japanese}-2147483644";
"konnichiwa {language:japanese}-2147483644" [label="konnichiwa {language:japanese}" color=green];
"konnichiwa {language:japanese}-2147483644" -> "language:japanese-2147483643";
"language:japanese-2147483643" [label="{language:japanese}" color=red];
"language:japanese-2147483643" -> "konnichiwa-2147483642";
"konnichiwa-2147483642" [label="konnichiwa" color=cadetblue shape=box];
"(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )-2147483646" -> "(<sidTests.person> = kelly)-2147483641";
"(<sidTests.person> = kelly)-2147483641" [label="(<sidTests.person> = kelly)" color=green];
"(<sidTests.person> = kelly)-2147483641" -> "person-2147483640";
"person-2147483640" [label="person" color=magenta];
"person-2147483640" -> "kelly-2147483639";
"kelly-2147483639" [label="kelly" color=green];
"kelly-2147483639" -> "kelly-2147483638";
"kelly-2147483638" [label="kelly" color=cadetblue shape=box];
"(<sidTests.greeting> = konnichiwa {language:japanese}) ( (<sidTests.person> = kelly) ) ( i am (<sidTests.person> = john {gender:man}) )-2147483646" -> "i am (<sidTests.person> = john {gender:man})-2147483637";
"i am (<sidTests.person> = john {gender:man})-2147483637" [label="i am (<sidTests.person> = john {gender:man})" color=green];
"i am (<sidTests.person> = john {gender:man})-2147483637" -> "i-2147483636";
"i-2147483636" [label="i" color=cadetblue shape=box];
"i am (<sidTests.person> = john {gender:man})-2147483637" -> "am-2147483635";
"am-2147483635" [label="am" color=cadetblue shape=box];
"i am (<sidTests.person> = john {gender:man})-2147483637" -> "person-2147483634";
"person-2147483634" [label="person" color=magenta];
"person-2147483634" -> "john {gender:man}-2147483633";
"john {gender:man}-2147483633" [label="john {gender:man}" color=green];
"john {gender:man}-2147483633" -> "gender:man-2147483632";
"gender:man-2147483632" [label="{gender:man}" color=red];
"gender:man-2147483632" -> "john-2147483631";
"john-2147483631" [label="john" color=cadetblue shape=box];
}

which is all a big mess until we run it through GraphViz and see this:

A graph explaining how our sentence was parsed! I color code the parse: green is a RuleSequence, magenta is a RuleParse, light blue is a Token, red is a Tag, yellow (which there strangely aren't any of) is a RuleName.
I notice two very strange things here. First, RuleParses don't have anything as a direct child except for RuleSequences (RuleName is also possible but not shown). So RuleSequences will always be present and may only have one child. Second, text is treated as a sub-component of a tag instead of the other way around. So the text is tagging the tag? I don't know why they designed it that way, but at least now that I have a graph of the parse so I can figure out how to properly process it.

Friday, May 11, 2012

Getting WordNet Verb Frames with JAWS

I love using JAWS to access WordNet. It has a rather extensive API, runs quickly, and doesn't require too much configuration. All you have to do is download the Jaws binary jar and WordNet, and then specify to JAWS where the WordNet files are (I will demonstrate this later).
One thing that did take a while to figure out was how to get verb frames from it. A verb frame is an indication of how the verb may be used. For example, the entry for the verb "fax" in WordNet contains the following frames:
  • Somebody ----s something to somebody
  • Somebody ----s somebody something
  • Somebody ----s somebody
  • Somebody ----s something
  • Somebody ----s
Let's look at the WordNet file to see how frames are specified. If you open data.verb, you will see that "sun" occurs twice. The frames are listed like so:

02112546 39 v 04 sun 0 insolate 0 solarize 0 solarise ... 01 + 08 00 | expose to the rays of the sun or affect by exposure to the sun


00104147 29 v 02 sun 0 sunbathe ... 03 + 02 00 + 22 00 + 09 01 | expose one's body to the sun 

The 01 and 03 indicate the number of verb frames, 08, 02, 22, and 09 are frame numbers. The 00's and 01 that follow the frame numbers indicate which words in the synset the numbers apply to. 00 means the frame is applicable to all members. The 01 in the second entry means that frame 9 is only for the word sun, and not for the second word, sunbathe.
There are two methods provided by JAWS to get frames. They are both contained in the VerbSynset class:

 /**
  * Returns the sentence frames (if any) associated with this verb meaning.
  * Sentence frames are examples of how the verb can be used / applied, and
  * all the frames returned by this method apply to all word forms in the
  * synset.
  * 
  * @return Sentence frames associated with all word forms in this synset.
  * @see    
  *         Format of Lexicographer Files ("Verb Frames")
  */
 public String[] getSentenceFrames();

 /**
  * Returns the sentence frames (if any) that are specific to a particular
  * word form within this synset, where sentence frames are examples of
  * how the word form can be used / applied.
  * 
  * @param  wordForm Word form for which to return sentence frames.
  * @return Sentence frames that are specific to the word form.
  * @see    
  *         Format of Lexicographer Files ("Verb Frames")
  */
 public String[] getSentenceFrames(String wordForm);

Keep in mind that the VerbSynset class is completely divorced from the actual orthographic representation of a word, since a synset may belong to several different words. The first method returns all of the frames that apply to every word in the synset, or to all of the frames marked with a 00 in the data.verb file as shown above. The second method returns only the frames which are marked as being specific to a single orthographic representation, specified by the one argument for the method. The return values are complementary and each is incomplete by itself. However, given only the synset offset or only the word to look up, JAWS is returning as much information as is possible. If you know both the synset number and the orthographic representation of a word you need frames for (and I don't see why you wouldn't), then the getWordFramesComplete method in the program below demonstrates how to get all of the available frames:

package edu.byu.xnlsoar.test;

import java.util.ArrayList;
import java.util.List;

import edu.smu.tspell.wordnet.Synset;
import edu.smu.tspell.wordnet.SynsetType;
import edu.smu.tspell.wordnet.VerbSynset;
import edu.smu.tspell.wordnet.WordNetDatabase;
import edu.smu.tspell.wordnet.impl.file.SampleFrameFactory;
import edu.smu.tspell.wordnet.impl.file.SynsetFactory;
import edu.smu.tspell.wordnet.impl.file.SynsetPointer;

public class DemoFrames {

 private static WordNetDatabase database;
 private static SynsetFactory synsetFactory;
 //initialize everything here
 static{
  System.setProperty("wordnet.database.dir", "./lib/3.0/dict");
  database = WordNetDatabase.getFileInstance();
  synsetFactory = SynsetFactory.getInstance();
 }
 
 /**
  * 
  * @param synsetOffset Synset number to look up frames for
  * @return Array of frames for the synset; only returns frames
  * which apply to every word in the synset
  * frames 
  */
 public static List<string> getGeneralSynsetFrames(int synsetOffset){
  SynsetPointer sp = new SynsetPointer(SynsetType.VERB, synsetOffset);
  VerbSynset vSynset = (VerbSynset) synsetFactory.getSynset(sp);
  List<string> frames = new ArrayList<string>(); 
  for(String s : vSynset.getSentenceFrames())
   frames.add(s);
  return frames;
 }

 /**
  * 
  * @param lemma Base form of the word you want to look up
  * @return Array of frames for the lemma; only returns those
  * that are specific to a particular word form within each synset.
  * frames 
  */
 public static List<string> getWordFramesSpecific(String lemma){
  List<string> frames = new ArrayList<string>();
  Synset[] synsets = database.getSynsets(lemma,SynsetType.VERB);
  for(Synset synset : synsets){
   for(String s : ((VerbSynset) synset).getSentenceFrames(lemma))
    frames.add(s);
  }
  return frames;
 }
 
 /**
  * This one is more difficult to understand...
  * @param lemma Base form of the word you want to look up
  * @return Array of frames for the lemma; only returns those
  * that match every word in each of the synsets that contain this word.
  */
 public static List<string> getWordFramesGeneral(String lemma){
  List<string> frames = new ArrayList<string>(); 
  Synset[] synsets = database.getSynsets(lemma,SynsetType.VERB);
  for(Synset synset : synsets){
   for(String s : ((VerbSynset) synset).getSentenceFrames())
    frames.add(s);
  }
  return frames;
 }

 /**
  * This method is the best. It returns all possible frames
  * given a synset number and the accompanying word.
  * @param lemma Base form of the word you want to look up
  * @param synsetOffset Synset number to look up frames for
  * @return Array of frames for the synset; returns all frames
  * for this word within this synset.
  * frames 
  */
 public static List<string> getWordFramesComplete(String lemma, int synsetOffset){
  SynsetPointer sp = new SynsetPointer(SynsetType.VERB, synsetOffset);
  VerbSynset vSynset = (VerbSynset) synsetFactory.getSynset(sp);
  List<string> frames = new ArrayList<string>(); 
  for(String s : vSynset.getSentenceFrames(lemma))
   frames.add(s);
  for(String s : vSynset.getSentenceFrames()) 
   frames.add(s);
  return frames;
 }
 
 /**
  * Prints out several different queries for the frames of "fax"
  */
 public static void main(String[] args) {
  int offset = 104147;//the synset meaning "expose one's body to the sun"
  System.out.println(getGeneralSynsetFrames(offset));//returns 2 frames
  System.out.println(getWordFramesSpecific("sun"));//returns 1 frame
  System.out.println(getWordFramesGeneral("sun"));//returns 3 frames
  System.out.println(getWordFramesComplete("sunbathe",offset));//returns 2 frames
  System.out.println(getWordFramesComplete("sun",offset));//returns 3 frames (different from before)
 }
}

getWordFramesComplete calls both of the available methods in JAWS, retrieving both frames that apply to all words in a synset and the frames that are specific to a single word in the synset.