Sunday, November 6, 2011

List of Japanese NLP tools

I haven't tried out all of these so I don't have comments for everything, but hopefully this list will come in useful for someone.

Morphological analyzers/tokenizers



  • Itadaki: a Japanese processing module for OpenOffice. I've done a tiny bit of work and issue documentation on a fork here, and someone forked that to work with a Japanese/German dictionary here.
  • GoSen: Uses sen as a base, and is part of Itadaki; a pure Java version of ChaSen. See my previous post on where to download it from.
  • MeCab: This page also contains a comparison of MeCab, ChaSen, JUMAN, and Kakasi.
  • ChaSen
  • JUMAN
  • Cabocha: Uses support vector machines for morphological and dependency structure analysis.
  • Gomoku
  • Igo
  • Kuromoji: Donated to Apache and used in Solr. Looks nice.

  • Corpora



  • Hypermedia Corpus
  • TüBa-J/S: Japanese treebank from universityu of Tübingen. Not as heavily annotated as I'd hoped. You have to send them an agreement to download it, but it's free.
  • GSK: Not free, but very cheap.
  • LDC: Expensive unless your institution is a member

  • Other lexical resources



  • Kakasi: Gives readings for kanji compounds.
  • WordNet: Stil under development by NiCT. The sense numbers are cross-indexed with those in the English WordNet, so it could be useful for translation. Also, there are no verb frames like there are in English.
  • LCS Database: From Okayama University
  • Framenet: Unfortunately you can only do online browsing.
  • Chakoshi: Online collocation search engine.
  • No comments:

    Post a Comment