Friday, December 30, 2011

Review: The Development of Language Processing Strategies: A Cross-linguistic Study Between Japanese and English

The Development of Language Processing Strategies: A Cross-linguistic Study Between Japanese and EnglishThe Development of Language Processing Strategies: A Cross-linguistic Study Between Japanese and English by Reiko Mazuka
My rating: 4 of 5 stars

This is basically an updated version of Mazuka's PHD thesis. This book is a significant work on human sentence processing involving data from a head final and a head initial language.
Mazuka presents data on sentence processing experiments with English speaking adults and Japanese speaking children and adults. She shows that sentence processing strategies are the same in children and adults (though their ability differs with age), and that sentence processing strategies differ cross-linguistically. Her experimental data include probe latency tasks (PLTs) for lexical and semantic information in English and Japanese sentences.
A probe latency task involves a subject listening to a sentence and responding to questions about its contents. In lexical PLT, a subject is asked if the sentence contained the specified lexical item. In semantic PLT, the subject is asked if a sentence contained a portion which has a similar meaning to a specified word or phrase. The experimenter then carefully designs sentences which test the subjects' ability to process different types of sentences. Mazuka's experiment measure response time and also the accuracy of the subjects' responses. Her findings for cross-linguistic processing differences are as follows:
English speakers showed processing differences for main and subordinate clauses, while Japanese speakers did not.
English speakers showed different effects for semantic and lexical tasks, while Japanese speakers did not.
In English speakers, response time for semantic probe latency tasks involving sentence-initial subordinate clauses (an LB structure) was increased, while in Japanese speakers it was greatly decreased.
Japanese speaker response times to both lexical and semantic PLTs involving left-branching and coordinate structures were the same; English speakers showed much larger recency effects in LB than coordinate sentences.
Hypotheses about the human language processing mechanism which assume a single processing strategy do not account for these data. Japanese speakers process LB structures efficiently, and English speakers process RB structures efficiently. This is impossible in a parser which assumes only one processing strategy, and a parser which can efficiently process both would be too powerful to account for real human data. For Japanese speakers to process LB structures as efficiently as English speakers do RB structures, processing must be done bottom-up instead of top-down. Mazuka therefore hypothesizes that Universal Grammar (UG) contains a parameter which determines whether a language is right- or left-branching (RB or LB), and that this is linked with the processing strategy by specifying whether processing should be done top-down or bottom-up. She also hypothesizes that in English, main and subordinate clauses are processed to a different semantic level at some initial encoding stage, accounting differences in English main and subordinate clause PLT tasks. This needs to be further tested in the future with PLTs involving two clause sentences beginning with an explicit subordinator in Japanese.
She states that future research is required to determine the exact relationship between her experimental data and the operation of the human sentence parser as she has hypothesized.
Since some languages such as German, actually branch in different directions for different types of clauses, her hypothesis needs to be revised to account for this. I'm hoping that her hypotheses can be tested in detail in some sort of a cogntive modeling system.

View all my reviews

Sunday, November 6, 2011

List of Japanese NLP tools

I haven't tried out all of these so I don't have comments for everything, but hopefully this list will come in useful for someone.

Morphological analyzers/tokenizers



  • Itadaki: a Japanese processing module for OpenOffice. I've done a tiny bit of work and issue documentation on a fork here, and someone forked that to work with a Japanese/German dictionary here.
  • GoSen: Uses sen as a base, and is part of Itadaki; a pure Java version of ChaSen. See my previous post on where to download it from.
  • MeCab: This page also contains a comparison of MeCab, ChaSen, JUMAN, and Kakasi.
  • ChaSen
  • JUMAN
  • Cabocha: Uses support vector machines for morphological and dependency structure analysis.
  • Gomoku
  • Igo
  • Kuromoji: Donated to Apache and used in Solr. Looks nice.

  • Corpora



  • Hypermedia Corpus
  • TüBa-J/S: Japanese treebank from universityu of Tübingen. Not as heavily annotated as I'd hoped. You have to send them an agreement to download it, but it's free.
  • GSK: Not free, but very cheap.
  • LDC: Expensive unless your institution is a member

  • Other lexical resources



  • Kakasi: Gives readings for kanji compounds.
  • WordNet: Stil under development by NiCT. The sense numbers are cross-indexed with those in the English WordNet, so it could be useful for translation. Also, there are no verb frames like there are in English.
  • LCS Database: From Okayama University
  • Framenet: Unfortunately you can only do online browsing.
  • Chakoshi: Online collocation search engine.
  • Itadaki GoSen and IPADIC 2.7

    Update3: I've forked the Itadaki project on GitHub to keep track of it better.
    Update2: I made an executable JAR for GoSen that runs the ReadingProcessorDemo. It requires Java 6; just unzip the contents of this zip file to your computer and click on the jar file.
    Update1: The IPADIC dictionary is no longer available from its original location. It has been replaced by the NAIST dictionary. I have edited the following post to reflect the needed changes.

    Itadaki is a software suite for processing Japanese in OpenOffice. GoSen, part of the Itadaki project, is a pure Java morphological analysis tool for Japanese, and I have found it extremely useful in my research. Unfortunately, the page for this project went down recently, making the tools harder to find. Itadaki is still available through Google code here, but I can't find a separate installment of GoSen. The old GoSen website can still be accessed through the way-back-machine here. The other problem is that GoSen hasn't been updated since 2007, and in it's current release cannot handle the latest release of IPADIC. I'll describe how to fix it in this post.
    Why does it matter that we can't use the latest version of IPADIC? Well, here's an example. I am using GoSen in my thesis work right now, and I put in a sentence which included a negative, past tense verb, such as 行かなかった. It analyzed it as な being used for negation, and かった being the past tense of the verb かう. That is indeed a problem! Using the newer IPADIC fixed it for me, though. To do that, download this modified version of GoSen. The explanation for the fix is here. Basically, a change in the new IPADIC versions to work better with MeCab adds a bunch of commas that break GoSen.


    Edit: Once you've downloaded and unzipped GoSen, run ant in the top directory to build an executable JAR file. Note that if you want javadoc, you'll have to change build.xml so that the javadoc command has 'encoding="utf-8"'. Next, you must download the IPADIC dictionary from its legacy repository, here. Unpack the contents into testdata/dictionary. Change testdata/dictionary/build.xml so that the value of "ipadic.version" is "2.7.0" (the version that you downloaded). Now run ant in this directory to build the dictionary. [If you had errors, you may have forgotten to run ant in the top level directory first.]


    Then, to run a demo and see what amazing things GoSen can do, copy the dictionary.xml file from the testdata/dictionary directory to the dictionary/dictionary directory, go back to the root directory of GoSen, and then run java -cp bin examples.ReadingProcessorDemo testData/dictionary/dictionary.xml. The GoSen site says to run using the testdata folder, but that means you'll have to download the dictionary twice, which is dumb. When you run the above command, you'll get this GUI:

    Notice that it tokenizes the sentence, gives readings, and allows you to choose among alternatives analyses. It also gives information on part of speech and inflection.
    To use GoSen in an Eclipse project, add gosen-1.0beta.jar to the project build path. You also need to have the dictionary directory somewhere, along with the dictionary.xml file. This code will get you started:

    package edu.byu.xnlsoar.jp.lexacc;
    
    import java.io.IOException;
    import java.util.List;
    
    import edu.byu.xnlsoar.utils.Constants;
    
    
    
    import net.java.sen.SenFactory;
    import net.java.sen.StringTagger;
    import net.java.sen.dictionary.Morpheme;
    import net.java.sen.dictionary.Token;
    
    public class GoSenInterface {
     public List tokenize(String sentence){
      StringTagger tagger = SenFactory.getStringTagger(Constants.getProperty("GOSEN_DICT_CONFIG"));
      try {
       return tagger.analyze(sentence);
      } catch (IOException e) {
       e.printStackTrace();
       System.exit(-1);
      }
      return null;
     }
     public static void main(String[] args){
      String sentence = "やっぱり日本語情報処理って簡単に出来ちゃうんだもんな。";
      GoSenInterface dict = new GoSenInterface();
      System.out.println("tokenizing " + sentence);
      List tokens = dict.tokenize(sentence);
      System.out.println(tokens);
      Morpheme m;
      System.out.println("surface, lemma, POS, conjugation");
      for(Token t : tokens){
       System.out.print(t + ", ");
       m = t.getMorpheme();
       System.out.print(m.getBasicForm() + ", ");
       System.out.print(m.getPartOfSpeech() + ", ");
       System.out.println(m.getConjugationalType());
      }
     }
    }
    

    If you run that you will get:
    tokenizing やっぱり日本語情報処理って簡単に出来ちゃうんだもんな。
    [やっぱり, 日本語, 情報処理, って, 簡単, に, 出来, ちゃう, ん, だ, もん, な, 。]
    surface, lemma, POS, conjugation
    やっぱり, やっぱり, 副詞-一般, *
    日本語, 日本語, 名詞-一般, *
    情報処理, 情報処理, 名詞-一般, *
    って, って, 助詞-格助詞-連語, *
    簡単, 簡単, 名詞-形容動詞語幹, *
    に, に, 助詞-副詞化, *
    出来, 出来る, 動詞-自立, 一段
    ちゃう, ちゃう, 動詞-非自立, 五段・ワ行促音便
    ん, ん, 名詞-非自立-一般, *
    だ, だ, 助動詞, 特殊・ダ
    もん, もん, 名詞-非自立-一般, *
    な, だ, 助動詞, 特殊・ダ
    。, 。, 記号-句点, *


    You have plenty of other options while processing, like grabbing alternate readings, etc. Notice that it got one wrong here: ちゃう is a contraction of てしまう, not a verb whose lemma is ちゃう. It doesn't seem to work on contractions because every token needs a surface form. So this might not work well on informal registers such as tweets or blogs unless some pre-preprocessing is done.
    Feel free to leave any questions or comments.

    Thursday, November 3, 2011

    CS 240 Web Crawler at BYU

    I recently polished off the web crawler project for CS 240 at BYU. It's probably the most talked-about project in the CS major, and the cause of so many students retaking the class.
    The specification for the web crawler assignment can be found here. Basically, given a start URL, the crawler finds every link on a page, follows them, downloads the pages, and indexes each of the words on a page, as long as they are not in a given stop words file; then it follows the links from that page, and so on.  All of the indexed information is printed out to XML files. The code also has to conform to proper style, and no memory leaks are allowed.
    For those who still need to do the project or haven't taken the following exam yet, I thought I'd post a note or two of help.
    First off, check your constructors! In an initialization for a templatized BST node, I had been invoking the default copy constructor. A copy constructor looks like this:

    T(const T & other)


    In the contained object, I had only implemented the operator= construction. My class T had pointers in it, and those pointers were to objects which were allocated on the heap with the new keyword. The default copy constructor copied the pointers, and when the copy of the object of type T was deleted, so were the structures that its pointers pointed to. Since the original object pointed to the same structures, that object would then cause a segfault when destroyed because it would try to delete non-existent structures. Ouch!

    That bug wasted a good 6 hours of my life. Needless to say, I was a little scared of the next assignment: a debugging exam. The class TAs put 4 bugs into our code (they didn't touch comments, asserts, or unit tests), and we had 3 hours to find them. Here's what the TA's did to my code:

    1. In my URL class, I call erase on a string representing a relative URL to get ride of the "../" at the beginning. The correct code is url.erase(0,3), but the TAs changed it to url.erase(0,2).
    2. In my BST Insert method, there is a control structure that determines whether to put a value on a node's left or right, and the TA's changed one of the left's to right's, i.e. node->left = new BSTNode<T> (v); was changed to node->right = new BSTNode<T> (v);.
    3. I have several boolean flags in an HTMLparser class which keep track of whether processing is inside of a header, title, body, or html tag. They should all be false at the beginning of processing, but one of them was changed to true, e.g.
      constructor():titleFlag(false),bodyFlag(false),headerFlag(true){...
    4. The last bug was a memory leak. In my linked list Insert method, I declare a linked list node, use a control structure to determine the proper location of the new node, and then set the node with a call to new and insert it in that location. The TA's changed the declaration to be a definition which used the new keyword, so I always allocated one extra node on the heap.
    The first three bugs I was I able find through unit testing. The last one I pinpointed using valgrind and print statements; however, even though it was staring me in the face, I couldn't find it and only got 75% on the exam.

    In case somebody finds the code interesting/useful, I'll post it here (no cheating!). Make with make bin. Run with bin/crawler <start url> <stopwords file> <output file>.

    Thursday, September 1, 2011

    String Allignment with Edit Operations

    A common way to measure the distance between two strings is using Levenshtein distance. Levenshtein distance is the minimum number of deletions, insertions, and substitutions needed to transform one string into another. Finding the distance between two strings is useful in certain applications such as spell checking (a word processor will suggest dictionary words that are close to your misspelled word). See wikipedia for more details and an example of Levenshtein distance calculation.
    Another related and also important operation is to find the minimum edit alignment; that is, once the minimum edit distance between the two strings is found, output the sequence of operations that can be used to change the one string into the other. For example, if we let C mean "correct", S mean "substitution", D mean "deletion" and I mean "insertion", then the edit alignment between the characters in "construction" and "distortions" would be ISSCCISSCCCCD. Here is an explanation of the alignment:
    • I: Insert a "c" ->cdistortions
    • S: Substitute "d" for "o" ->coistortions
    • S: Substitute "i" for "n" -> constortions
    • CC: Leave the "st" alone
    • I: Insert "r" -> constrortion
    • S: Substitute "u" for "o" -> contrurtion
    • S: Substitute "c" for "r" -> constructions
    • CCCC: Leave "tion" alone
    • D: delete "s" -> construction
    This sort of an allignment operation is commonly used in grading the output of machine translation and speech recognition systems, though the operations are done at the word level instead of the character level. When I needed a quick piece of code to do the alignment in my own application, though, there was none to be found on the internet! So I made my own and am posting it here. The logic for determining the proper string of edit operations is taken from a StackOverflow answer here. And now the code:
    	/**
    	 * @return List of Operations representing allignment between list1 and
    	 * list2. The allignment represents operations to change list2 into list1.
    	 */
    	
    	public static List levenshteinAllignment(Object[] list1, Object[] list2) {
    		int[][] distanceMatrix = getDistanceMatrix(list1, list2);
    		List ops= new ArrayList(
    				list1.length > list2.length ? list1.length : list2.length);
    		//think of distance chart as going from bottom left to top right;
    		//current position coordinates; start at top right.
    		int row = list1.length;
    		int col = list2.length;
    		//could have moved to current position from three others; store their scores here.
    		int diag;
    		int left;
    		int below;
    		int current;
    		
    		while (row != 0 || col != 0) {
    			diag = getVal(row-1,col-1,distanceMatrix);
    			left = getVal(row,col-1,distanceMatrix);
    			below = getVal(row-1,col,distanceMatrix);
    			current = distanceMatrix[row][col];
    
    //			   if the value in the diagonal cell (going up+left) is smaller or equal to the
    //			      values found in the other two cells
    				
    //			   AND 
    //			      if this is same or 1 minus the value of the current cell 
    			if(diag <= left && diag <= below && (diag == current || diag == current - 1)){
    //				   then  "take the diagonal cell"
    //		         if the value of the diagonal cell is one less than the current cell:
    				if(diag == current - 1)
    //		            Add a SUBSTITUTION operation (from the letters corresponding to
    //		            the _current_ cell)
    					ops.add(new Operation(Operation.Type.SUBSTITUTION,list1[row-1],list2[col-1]));
    				else
    //			        otherwise: do not add an operation this was a no-operation.
    					ops.add(new Operation(Operation.Type.CORRECT,list1[row-1]));
    				//move diagonally
    				row--;
    				col--;
    			}
    //
    //		    elseif the value in the cell to the left is smaller or equal to the value of
    //		  		the cell below current cell
    //		   	AND 
    //	       	if this value is same or 1 minus the value of the current cell 
    			else if(left < below && (left == current || left == current - 1)){
    //		        add an INSERTION of the cell to the left
    				ops.add(new Operation(Operation.Type.INSERTION,list2[col-1]));
    				//move left
    				col--;
    			}
    //		   	else
    			else{
    //		       take the cell below, add
    //		       Add a DELETION operation
    				ops.add(new Operation(Operation.Type.DELETION,list1[row-1]));
    				//move down
    				row--;
    			}
    		}
    		Collections.reverse(ops);
    		return ops;
    	}
    	private static int getVal(int row, int col, int[][] distanceMatrix){
    		if(row < 0 || row > distanceMatrix.length)
    			return Integer.MAX_VALUE;
    		if(col < 0 || col > distanceMatrix.length)
    			return Integer.MAX_VALUE;
    		else return distanceMatrix[row][col];
    	}
    	public static class Operation{
    		private Type type;
    		private Object object1;
    		private Object object2;
    		public enum Type{
    			CORRECT,SUBSTITUTION,DELETION,INSERTION
    		}
    		public Operation(Type t, Object o1){
    			type = t;
    			object1 = o1;
    			object2 = null;
    		}
    		public Operation(Type t, Object o1, Object o2){
    			type = t;
    			object1 = o1;
    			object2 = o2;
    		}
    		@Override
    		public String toString(){
    			if(type == Type.SUBSTITUTION)
    				return "Substitution: " + object1.toString() + " -> " + object2.toString();
    			if(type == Type.CORRECT)
    				return "Correct:      " + object1.toString();
    			if(type == Type.DELETION)
    				return "Deletion:     " + object1.toString();
    			if(type == Type.INSERTION)
    				return "Insertion:    " + object1.toString();
    			return null;
    		}
    	}
    
    	/**
    	 * 
    	 * @param array of objects to compare
    	 * @param array of objects to compare
    	 * @return Levenshtein distance between arrays.
    	 * This method uses the equals(Object o) method to compare the
    	 * objects in the two arrays, returning the Levenshtein distance between them.
    	 */
    	public static int levenshteinDistance(Object[] list1, Object[] list2) {
    
    		int[][] distance = getDistanceMatrix(list1, list2);
    		return distance[list1.length][list2.length];
    	}
    
    	/**
    	 * 
    	 * @param list1
    	 * @param list2
    	 * @return A completely filled distance matrix; movement from [i-1][j]
    	 *         represents insertion, from [i][j-1] represents deletion, and from
    	 *         [i-1][j-1] represents substitution or no operation.
    	 */
    	private static int[][] getDistanceMatrix(Object[] list1, Object[] list2) {
    		int[][] distanceMatrix = new int[list1.length + 1][list2.length + 1];
    
    		for (int i = 0; i <= list1.length; i++)
    			distanceMatrix[i][0] = i;
    		for (int j = 0; j <= list2.length; j++)
    			distanceMatrix[0][j] = j;
    
    		for (int i = 1; i <= list1.length; i++)
    			for (int j = 1; j <= list2.length; j++)
    				distanceMatrix[i][j] = minimum(distanceMatrix[i - 1][j] + 1,// insertion
    						distanceMatrix[i][j - 1] + 1,// deletion
    						distanceMatrix[i - 1][j - 1]// substitution or correct
    								+ ((list1[i - 1].equals(list2[j - 1])) ? 0 : 1));
    		return distanceMatrix;
    	}
    
    	/**
    	 * Same as Math.min, but returns the minimum of three arguments instead of
    	 * two.
    	 */
    	private static int minimum(int a, int b, int c) {
    		return Math.min(Math.min(a, b), c);
    	}
    
    
    For word level alignment, you can call it on Strings using the split function like so:
    for(Operation o : levenshteinAllignment(
    			"What My house gleams with the light of the moon and your face"
    					.split(" "),
    			"Your house with the light of the the moon and my face"
    					.split(" "))
    		)
    			System.out.println(o);
    
    And the output would be:
    Deletion: What
    Substitution: My -> Your
    Correct: house
    Deletion: gleams
    Correct: with
    Correct: the
    Correct: light
    Correct: of
    Correct: the
    Insertion: the
    Correct: moon
    Correct: and
    Substitution: your -> my
    Correct: face
    Feel free to use and edit this as you like. Many applications disregard the strings that are correct and only output the edit operations, and that should be an easy edit.

    Wednesday, August 17, 2011

    Distributional Statistics of Log Values in Linear Domain

    Quite a long title, sorry. Basically, I had a bunch of log values (and they needed to stay logarithmic to avoid underflow) and I wanted to compute distributional statistics on them, like mean, variance and kurtosis. I wasn't sure if it would be valid to compute these kinds of statistics on the numbers as is, so I created a class to do all of the calculations in linear space. I'll post the result here, though there may be bugs (tell me if you find some!).

    One thing to remember is that if the statistics map onto a negative number in linear space then it will be impossible to take the logarithm; therefore, these are invalid operations and you have to consider this before trying to retrieve any numbers from this program. I hope someone finds this useful!

    package edu.jhu.clsp.ws11.rerank.utils;
    
    import java.util.Arrays;
    
    /**
     * This class returns distributional statistics given a list of numbers. The numbers are assumed to
     * be in logarithmic space, and all of the computation is done on numbers converted from log to linear
     * space; the results are returned again in log space.
     * @author Nate Glenn
     *
     */
    public class LogDistributionalStats {
    	private double[] numbers;
    	private int N;//number of numbers input
    	private double logN;//log(N)
    	private double min; 
    	private double median;
    	private double max;
    	private double mean;
    	private double avgAbsDeviation = 0;
    	private double standardDeviation = 0;
    	private double variance = 0;
    	private double skew = 0;
    	private double kurtosis = 0;
    	private double sum;
    	/**
    	 * Compute statistics on nums. If norm is true, then compute statistics after normalizing
    	 * the array, except for min, mean, and max.
    	 * 
    	 */
    	public LogDistributionalStats(double[] nums, boolean norm){
    		N = nums.length;
    		//must make new array so as to avoid overwriting the input.
    		numbers = new double[N];
    		for(int i = 0; i < numbers.length; i++)
    			numbers[i] = nums[i];
    		logN = Math.log(N);
    		//compute sum, mean, min, and max before normalization (if done at all)
    		sum = sumAsLinear();
    		mean = sum - logN;
    		Arrays.sort(numbers);
    		min = numbers[0];
    		max = numbers[N-1];
    		if(norm)
    			ArrayUtils.minusAll(numbers,sum);
    		
    		double deviation;
    		if(N > 1){
    			for(double d : numbers){
    				deviation = LogMath.linearDifference(mean, d);
    				avgAbsDeviation = LogMath.addAsLinear(avgAbsDeviation, deviation);
    				variance += deviation*2;
    				skew += deviation*3;
    				kurtosis += deviation*4;
    			}
    			variance -= Math.log(N-1);
    			standardDeviation = variance/2;
    			skew -= logN+variance+standardDeviation;
    			//don't do negative 3 calculation here.
    			kurtosis = kurtosis-(logN + 2*variance);
    		}
    		else{
    			for(double d : numbers){
    				deviation = LogMath.linearDifference(mean, d);
    				avgAbsDeviation = LogMath.addAsLinear(avgAbsDeviation, deviation);
    			}
    			variance = Double.NaN;
    			standardDeviation = Double.NaN;
    			skew = Double.NaN;
    			kurtosis = Double.NaN;
    		}
    		avgAbsDeviation -= logN;
    		int mid = N/2;
    		if(N % 2 == 0)
    			median = LogMath.addAsLinear(numbers[mid-1], numbers[mid]) - Math.log(2);
    		else
    			median = numbers[mid];
    	}
    	
    	/**
    	 * 
    	 * @param nums
    	 * @return Linear space sum of all numbers in nums
    	 */
    	private double sumAsLinear() {
    		double total = 0;
    		for(double d : numbers)
    			total = LogMath.addAsLinear(total, d);
    		return total;
    	}
    
    	public double getMin() {
    		return min;
    	}
    
    	public double getMax() {
    		return max;
    	}
    
    	public double getMean() {
    		return mean;
    	}
    
    	public double getStandardDeviation() {
    		return standardDeviation;
    	}
    	
    	public double getVariance() {
    		return variance;
    	}
    	
    	public double getSkew() {
    		return skew;
    	}
    
    	public double getSum() {
    		return sum;
    	}
    	
    	/**
    	 * Kurtosis is not calculated with any linear combinations (subtracting three)
    	 * This is because it is often impossible to convert this to log space, since
    	 * the final product is so often negative. If you want the minus three back again, you can
    	 * try to minus it yourself and handle any exceptions (use LogMath.minusAsLinear()).
    	*/
    	public double getKurtosis() {
    		return kurtosis;
    	}
    
    	public double getMedian() {
    		return median;
    	}
    	
    }
    
    
    

    Friday, July 29, 2011

    Parsing in a Cognitive Modeller

    A colleague of mine recently expressed interest in the research I am doing for an honors thesis. "Syntactic parsing in a cognitive modeling system" isn't exactly a crystal clear expression of my work, after all. I embarrassed myself by throwing out a few cursory statements before trailing off and ending with a "it's kind of hard to explain". Why can't I explain my own work? Partly because I don't explain it very often and therefore I'm bad at it. Another reason is that I have met with opposition to it in the past. Fellow linguists are immediately fascinated with the idea of modeling language use inside of a digital brain; computer scientists remain relatively unimpressed. "There are plenty of blazing fast parsers out there, so why build one just to act like a human? Besides, you don't even know if the computer is doing the same thing we are!" I'm going to use this and the next post to organize my thoughts and try to explain how parsing works in a cognitive modeling system, and why we would even try it in the first place.
    This post will address cognitive modeling, and the next will move on to parsing.
    Lets pretend that some crazy scientist manages to create an unstable time portal, like Will Robinson's, and that before getting to use it himself a rather valuable computer gets sucked through instead, sending it back to sometime in the 1950s. There, a curious electrical engineer finds it, and, realizing the novelty of it, shares his discovery with the scientists of his academic community. Though the origin of the machine is certainly not surmisable, they will try their darndest to figure out how it works.
    They go about this by studying two aspects of the machine: the hardware and the software. When they look inside, they observe the mass of wires, chips, resistors, capacitors, and a myriad of other tiny gadgets all somehow integrated into one perfect system. They measure voltage across different points in the circuitry and observe the flow of power from the battery into the other parts. They analyze samples of it to discover what material it is made of, and run a whole bunch of other tests that are not entirely clear to non-engineers.
    Observing the software, on the other hand, is much less complicated because all they have to do is turn it on. Let's just say Will Robinson's computer was a futuristic Mac of some sort. Then turning it on greets the scientists with a welcome screen and then Will's desktop:
    The scientists are intrigued by the fact that the computer can do all sorts of complicated things, like play music and videos, compress files, typeset documents and manage large spreadsheets seemingly without doing any work. Putting it in perspective, their own computers look like this. They can surmise basically that there is one central system called OS-X that runs everything else, and that each of the functions run in separate programs. They learn that each of the programs run in a window, that programs are made up of more basic functions such as file management and that certain things are impossible, like creating files with "?" in the name.
    They also tinker around with the hardware to see how each piece affects the software. Through this they uncover the difference between the RAM and a hard drive, the basic purpose of the CPU, and also that the USB ports transmit information.
    After years of studying the Mac, they attempt to create machines which imitate its functions. One person creates a crude screen, another makes some memory with pathetic storage size, and others draft intricate blueprints explaining how the programs function within the machine. They don't all agree on the underlying mechanisms, so they split and pursue different theories. The end product is several schools of research attempting to build the machine by studying different aspects of it. Will they ever make an actual Macintosh themselves? Not likely, with 1950s technology. How can they even tell if they've gotten it right? They can experiment with their own machine and see if it acts somewhat like the Mac.

    The scientists study the hardware and software inside of the futuristic Mac.
    Now, what does all of this have to do with cognitive modeling?
    Wait, what is cognitive modeling? To explain that, we first need a few definitions:
    • Cognition: the excercise of human intelligence, including both deliberation and action, and performance of the wide variety of tasks that humans participate in.
    • Cognitive Science: the study of minds as information processors, including how information is processed, represented, and transformed.
    Cognitive science attempt to explain cognition, or human intelligence, by discovering (or theorizing about) the processes and structures underlying it at the lowest level, analyzing the brain as if it were a computer. We study the mind in the same way those scientists studied the Mac: either by observing the actual brain (EEG, fMRI) or observing behavior (eye tracking, reaction time), or in some cases even observing what happens when certain parts break or are damaged by sickness. They've found some interesting things. For example, look at the Wikipedia article to see what what we know about different types of memory.
    Cognitive scientists study and imitate the physical and behavioral aspects of the human mind.
    Data from these types of experiments contribute to our understanding of the mind, but we still do not completely understand the complex processes that make humans what they are. We can't try our hand at building a human, like the scientists did the Mac, either. Besides technological concerns, there are also ethical ones. Instead, scientists create computer models to simulate human activity. Although there have been many models which simulate single aspects of cognition such as hand-eye coordination or reading, general cognitive frameworks, which model human behavior overall, have also been created.
    These computational cognitive models are extremely useful to researchers because they provide a universal testing ground in which mini theories about cognition can be tested. They form what Allen Newell called Unified Theories of Cognition (UTC). The name basically means that if a researcher has a theory of how one cognitive activity works, then it should fit within the larger, unified framework already tried and tested by the scientific community. Once the mini theory is implemented within the larger framework, experiments can be run using the resulting model, which is guaranteed to exhibit the properties of the larger framework. This has the benefit of constraining the variables in one's own model, making it both easier to design and scientifically more sound.
    There are several other reasons that these models are useful:
    • You don't have to pay the model to take your experiment, nor do you have to pay a technician to scan its brain while it carries out various tasks. Though initial programming and maintenance cost time and money, these models will carry out an infinite number of tasks for free. 
    • Because we can step through the execution of a program, we can see exactly what the model is "thinking" at all times (akin to "think aloud protocol"), allowing true introspection into the nature of the model and the theoretical consequences of its acceptance.
    • Even if a model is worked out meticulously by hand, humans are error prone. Running the program on a computer guarantees accurate evaluation of the model.
    • Models shared by the community provide baseline results, making work from different researchers comparable and making it easier to measure the advancement of the field.
    • Models can be shared with other researchers easily via the internet.
    There are several such available frameworks, including Soar, Allen Newell's creation, and ActR, which is more popular and seems to draw more government funding.
    Like the 1950s scientists, researchers in cognition have split into different schools which study different aspects of the mind. The main split is between symbolic and subsymbolic models.

    • Symbolic models focus on the abstract symbol-processing capabilities of humans; we can combine physical patterns into structures and can also produce new expressions by manipulating other structures, e.g. art, language, music.
    • Subsymbolic models focus on the neural properties of the brain; the most widely known is connectionism, which models complex behavior through connections between simple nodes.
    And research thrives and will continue to thrive for a long time, advancing artificial intelligence, medicine, economics, and other sciences related to human behavior. Have we made a working human? No. Do we think we understand them? Yes, but not perfectly. Cognitive models are based on years and years of research and empirical backing. It's safe to say that human behavior can be modeled and even predicted at some level.
    Here are some of the modeling projects that use a general cognitive framework:

    A bunch using the COGENT framework; medical diagnoses, mental rotation, towers of Hanoi.
    Learning to Play Mario using SOAR.
    Pilot modeling using SOAR. Modeling pilot behavior for better mission planning and design.
    Simulated students using ACT-R. The authors evaluate different instructional methods on simulated students.
    Eagle Eye. Oops! That's not real. Maybe some day.

    Language is an extremely complex phenomenon, but it too must be simulated in some way if we are to confirm the validity of our models. More on that next time.

    Sunday, July 24, 2011

    How to say "p.s. you suck" in Japanese

    This post will focus on two very interesting vocabulary items in Japanese; I call them "p.s. you suck" suffixes, though I'm sure there is a better word to describe them.
    Adding a め to a noun insults the noun it is suffixed to:

    • 馬鹿め!["baka-me", idiot!]
    • この小娘め!["kono komusume-me!", stupid girl!]
    • 太郎め! ["Tarou-me!", dang Tarou!]

    The verb ["yagaru"] attaches to the base II or connective form of verbs and adds the same feeling as the め suffix:

    • 覚えてやがれ!["oboeteyagare!", remember this you fool!]
    • 罠にかかりやがって... ["wana ni kakariyagatte...", getting caught in a trap, you're so stupid.]
    • 何を言いやがるんだ! ["nani o iiyagarunda!", what the heck are you saying?!]

    Both of these forms are used when yelling at or cursing some entity.
    I personally don't use them a lot, but they are very common in anime, probably because of the dramatic nature. Here's an example for やがる. It's the last word the villain says (besides きら, a sparkly sound akin to "bwing" or "bling"). Note that the subtitle should be "Remember this!", not "I'll remember this!".

    Thursday, July 21, 2011

    Japanese Stackexchange

    Stackoverflow is one of those sites born of an idea so golden you can hardly believe nobody has ever done it before. Using a simple system of reputation, priveleges and badges, users are encouraged to post answers to on-topic questions for the benefit of the asker and the internet community at large. I go there whenever I need help with a random programming problem. Questions are usually answered quickly because people want the points, and I don't have to feel bad about bothering anyone because asking a question gives someone the opportunity to build their reputation. Started in 2008, the site was immediately a phenomenal success and is still alive and well today. Though stackoverflow is focused on computer programming, the launch of stackexchange allowed the growth of other sites on a variety of content. area 51is sort of a sandbox for people to propose new sites and build communities to nourish and govern those sites.
    Recently I was excited to find that a Japanese stackexchange beta site was ready to open. On opening day I was hooked! I answered a large portion of the questions that were answered that day. In fact, I was so hooked that I had to stop posting... I think that part of the success of stack exchange is that the design creates motivation to answer questions in a manner similar to addicting online games; reputation points aren't worth anything in real life, but somehow racking 'em up is such a satisfying experience that one keeps going back for more. I consider harvesting the power of addictive point systems for the good of mankind to be a praiseworthy accomplishment, but I must refrain myself from getting caught up in it. I will, however, not stop myself from asking questions, because however fun that may be it is not as addicting.
    Anyway, I just wanted to put in a good word for the new Japanese stackexchange site. If it doesn't get enough traffic it will be shut down, so tell everyone you know who wants to learn Japanese to start asking questions! Or, if you know Japanese, start answering.  The best way to learn is to teach. Plus you get this cool banner:


    profile for Nate Glenn at Japanese Language and Usage, Q&A for students, teachers, and linguists wanting to discuss the finer points of the Japanese language

    Friday, June 10, 2011

    Some Quốc Ngữ History

    I had the privelege of studying Vietnamese for 2 semesters at my university. During the course of my studies, I had to learn the system of writing Vietnamese known as Quốc Ngữ. Seeing as how the older writing system (Chữ Nôm), which employed Chinese characters, was used up until the beginning of the 20th century, I could not understand why the spelling of such a new system was so irregular. Recently, however, I began reading An Introduction to Vietnamese Literature (Duran and Huan, 1985), which contains a section on the evolution of the quoc ngu system.
    Quốc Ngữ was put together by Spanish, Italian and Portuguese missionaries, using spelling rules they were already familiar with. G and gh for example. In Italian, g is pronounced hard unless it is gi or ge, in which case it makes a j sound; Ghe and ghi are used to make a hard g in those cases. This spelling carried over to Vietnamese spelling rules; gi now makes a z or y sound, depending on the dialect, while gh is used for a hard g sound before i and e, e.g. ghế, ghi.
    This and many other irregularities were taken notice of by Vietnamese and French linguists early on in the adoption of quốc Ngữ. Several times reforms were put forward for consideration, but all of them came to nothing. For example, in 1902, at the First International Congress of Far Eastern Studies in Hanoi, Vietnam's first Transliteration Board suggested the following improvments:

    • should be kept for the series ga, ge, go, gó, gu, gú; gi should be replaced by j;
    • the k sound should always be spelled k, in order to avoid having ca and co on the one hand and ke and ki on the other.
    • â, the sound of which is close to o, should be replaced by hooked a;
    • y should be substituted for i in ly, ky, my, etc.;
    • ch should be written as č or c, as h should be reserved for aspiration in kh, th, etc. e.g. co for cho ("market"), khac for khach ("guest, foreigner");
    • d should be replaced by z;
    • n should be used in place of nh (ban for banh ["cake, bread"]);
    • w should be used to denote the semi-vowel (kwa rather than qua ["to pass"]);
    • s should signify a retroflex s;
    • x, the sound of which is close to that of the palatal sibilant, should be replaced bç.


      All of the changes made sense phonemically (provided that the change for n occurred only word finally), but none of them caught on. Supposedly force of habit, along with the difficulty of altering fonts for printing, were the cause.
      I wonder how much Vietnamese has changed since then. I also wonder how the petrification of quoc ngu irregularities may have affected modern Vietnamese; spelling, when internalized by a native speaker, can affect pronunciation (sometimes by overcorrection). My teacher was adamant that I pronounce khach with a ch sound at the end.
      The book is on google here.