Sunday, November 6, 2011

List of Japanese NLP tools

I haven't tried out all of these so I don't have comments for everything, but hopefully this list will come in useful for someone.

Morphological analyzers/tokenizers

  • Itadaki: a Japanese processing module for OpenOffice. I've done a tiny bit of work and issue documentation on a fork here, and someone forked that to work with a Japanese/German dictionary here.
  • GoSen: Uses sen as a base, and is part of Itadaki; a pure Java version of ChaSen. See my previous post on where to download it from.
  • MeCab: This page also contains a comparison of MeCab, ChaSen, JUMAN, and Kakasi.
  • ChaSen
  • Cabocha: Uses support vector machines for morphological and dependency structure analysis.
  • Gomoku
  • Igo
  • Kuromoji: Donated to Apache and used in Solr. Looks nice.

  • Corpora

  • Hypermedia Corpus
  • TüBa-J/S: Japanese treebank from universityu of Tübingen. Not as heavily annotated as I'd hoped. You have to send them an agreement to download it, but it's free.
  • GSK: Not free, but very cheap.
  • LDC: Expensive unless your institution is a member

  • Other lexical resources

  • Kakasi: Gives readings for kanji compounds.
  • WordNet: Stil under development by NiCT. The sense numbers are cross-indexed with those in the English WordNet, so it could be useful for translation. Also, there are no verb frames like there are in English.
  • LCS Database: From Okayama University
  • Framenet: Unfortunately you can only do online browsing.
  • Chakoshi: Online collocation search engine.
  • Itadaki GoSen and IPADIC 2.7

    Update3: I've forked the Itadaki project on GitHub to keep track of it better.
    Update2: I made an executable JAR for GoSen that runs the ReadingProcessorDemo. It requires Java 6; just unzip the contents of this zip file to your computer and click on the jar file.
    Update1: The IPADIC dictionary is no longer available from its original location. It has been replaced by the NAIST dictionary. I have edited the following post to reflect the needed changes.

    Itadaki is a software suite for processing Japanese in OpenOffice. GoSen, part of the Itadaki project, is a pure Java morphological analysis tool for Japanese, and I have found it extremely useful in my research. Unfortunately, the page for this project went down recently, making the tools harder to find. Itadaki is still available through Google code here, but I can't find a separate installment of GoSen. The old GoSen website can still be accessed through the way-back-machine here. The other problem is that GoSen hasn't been updated since 2007, and in it's current release cannot handle the latest release of IPADIC. I'll describe how to fix it in this post.
    Why does it matter that we can't use the latest version of IPADIC? Well, here's an example. I am using GoSen in my thesis work right now, and I put in a sentence which included a negative, past tense verb, such as 行かなかった. It analyzed it as な being used for negation, and かった being the past tense of the verb かう. That is indeed a problem! Using the newer IPADIC fixed it for me, though. To do that, download this modified version of GoSen. The explanation for the fix is here. Basically, a change in the new IPADIC versions to work better with MeCab adds a bunch of commas that break GoSen.

    Edit: Once you've downloaded and unzipped GoSen, run ant in the top directory to build an executable JAR file. Note that if you want javadoc, you'll have to change build.xml so that the javadoc command has 'encoding="utf-8"'. Next, you must download the IPADIC dictionary from its legacy repository, here. Unpack the contents into testdata/dictionary. Change testdata/dictionary/build.xml so that the value of "ipadic.version" is "2.7.0" (the version that you downloaded). Now run ant in this directory to build the dictionary. [If you had errors, you may have forgotten to run ant in the top level directory first.]

    Then, to run a demo and see what amazing things GoSen can do, copy the dictionary.xml file from the testdata/dictionary directory to the dictionary/dictionary directory, go back to the root directory of GoSen, and then run java -cp bin examples.ReadingProcessorDemo testData/dictionary/dictionary.xml. The GoSen site says to run using the testdata folder, but that means you'll have to download the dictionary twice, which is dumb. When you run the above command, you'll get this GUI:

    Notice that it tokenizes the sentence, gives readings, and allows you to choose among alternatives analyses. It also gives information on part of speech and inflection.
    To use GoSen in an Eclipse project, add gosen-1.0beta.jar to the project build path. You also need to have the dictionary directory somewhere, along with the dictionary.xml file. This code will get you started:

    import java.util.List;
    import edu.byu.xnlsoar.utils.Constants;
    public class GoSenInterface {
     public List tokenize(String sentence){
      StringTagger tagger = SenFactory.getStringTagger(Constants.getProperty("GOSEN_DICT_CONFIG"));
      try {
       return tagger.analyze(sentence);
      } catch (IOException e) {
      return null;
     public static void main(String[] args){
      String sentence = "やっぱり日本語情報処理って簡単に出来ちゃうんだもんな。";
      GoSenInterface dict = new GoSenInterface();
      System.out.println("tokenizing " + sentence);
      List tokens = dict.tokenize(sentence);
      Morpheme m;
      System.out.println("surface, lemma, POS, conjugation");
      for(Token t : tokens){
       System.out.print(t + ", ");
       m = t.getMorpheme();
       System.out.print(m.getBasicForm() + ", ");
       System.out.print(m.getPartOfSpeech() + ", ");

    If you run that you will get:
    tokenizing やっぱり日本語情報処理って簡単に出来ちゃうんだもんな。
    [やっぱり, 日本語, 情報処理, って, 簡単, に, 出来, ちゃう, ん, だ, もん, な, 。]
    surface, lemma, POS, conjugation
    やっぱり, やっぱり, 副詞-一般, *
    日本語, 日本語, 名詞-一般, *
    情報処理, 情報処理, 名詞-一般, *
    って, って, 助詞-格助詞-連語, *
    簡単, 簡単, 名詞-形容動詞語幹, *
    に, に, 助詞-副詞化, *
    出来, 出来る, 動詞-自立, 一段
    ちゃう, ちゃう, 動詞-非自立, 五段・ワ行促音便
    ん, ん, 名詞-非自立-一般, *
    だ, だ, 助動詞, 特殊・ダ
    もん, もん, 名詞-非自立-一般, *
    な, だ, 助動詞, 特殊・ダ
    。, 。, 記号-句点, *

    You have plenty of other options while processing, like grabbing alternate readings, etc. Notice that it got one wrong here: ちゃう is a contraction of てしまう, not a verb whose lemma is ちゃう. It doesn't seem to work on contractions because every token needs a surface form. So this might not work well on informal registers such as tweets or blogs unless some pre-preprocessing is done.
    Feel free to leave any questions or comments.

    Thursday, November 3, 2011

    CS 240 Web Crawler at BYU

    I recently polished off the web crawler project for CS 240 at BYU. It's probably the most talked-about project in the CS major, and the cause of so many students retaking the class.
    The specification for the web crawler assignment can be found here. Basically, given a start URL, the crawler finds every link on a page, follows them, downloads the pages, and indexes each of the words on a page, as long as they are not in a given stop words file; then it follows the links from that page, and so on.  All of the indexed information is printed out to XML files. The code also has to conform to proper style, and no memory leaks are allowed.
    For those who still need to do the project or haven't taken the following exam yet, I thought I'd post a note or two of help.
    First off, check your constructors! In an initialization for a templatized BST node, I had been invoking the default copy constructor. A copy constructor looks like this:

    T(const T & other)

    In the contained object, I had only implemented the operator= construction. My class T had pointers in it, and those pointers were to objects which were allocated on the heap with the new keyword. The default copy constructor copied the pointers, and when the copy of the object of type T was deleted, so were the structures that its pointers pointed to. Since the original object pointed to the same structures, that object would then cause a segfault when destroyed because it would try to delete non-existent structures. Ouch!

    That bug wasted a good 6 hours of my life. Needless to say, I was a little scared of the next assignment: a debugging exam. The class TAs put 4 bugs into our code (they didn't touch comments, asserts, or unit tests), and we had 3 hours to find them. Here's what the TA's did to my code:

    1. In my URL class, I call erase on a string representing a relative URL to get ride of the "../" at the beginning. The correct code is url.erase(0,3), but the TAs changed it to url.erase(0,2).
    2. In my BST Insert method, there is a control structure that determines whether to put a value on a node's left or right, and the TA's changed one of the left's to right's, i.e. node->left = new BSTNode<T> (v); was changed to node->right = new BSTNode<T> (v);.
    3. I have several boolean flags in an HTMLparser class which keep track of whether processing is inside of a header, title, body, or html tag. They should all be false at the beginning of processing, but one of them was changed to true, e.g.
    4. The last bug was a memory leak. In my linked list Insert method, I declare a linked list node, use a control structure to determine the proper location of the new node, and then set the node with a call to new and insert it in that location. The TA's changed the declaration to be a definition which used the new keyword, so I always allocated one extra node on the heap.
    The first three bugs I was I able find through unit testing. The last one I pinpointed using valgrind and print statements; however, even though it was staring me in the face, I couldn't find it and only got 75% on the exam.

    In case somebody finds the code interesting/useful, I'll post it here (no cheating!). Make with make bin. Run with bin/crawler <start url> <stopwords file> <output file>.