[Go to Hiroki Arimura's home page]

WASA:

A Simple Word-N-gram Enumeration Tool based on Word-based Suffix Array and LCP Array

Hiroki Arimura

Division of Computer Science, Graduate School of Information Science and Technology,
Hokkaido Univeristy
http://www-ikn.ist.hokudai.ac.jp/~arim/arim_en.html


Abstract

WASA is a text mining program for finding all word-N-grams in a given collections of texts using suffix arrays (word phrases). As an applicaiton, given a parameter S for minimum frequency threshold, the program can find all frequent word-N-grams appearing in no less than S sentences in the input documents. WASA can also find the best-K word-N-grams that optimize a certain statistical evaluation function such as Shannon entropy or classfication measure based on their occurrences in a pair of positive and negative documents.


Documents


Implementation


Notes

Japanese multi-byte texts

WASA is 8-bit clean as minimum requirement. Though I did not yet implement any further treatment for handling Japanese texts at this moment, you can handle Japanese texts by WASA as follows.

Japanese Token texts generated by a Japanese Lexcal Analyzer

You can also use the WASA program to find token sequences from parsed documents preprocessed by a Japanese lexical analyzer such as "Chasen" or "Mecab" on Japanese texts as follows:


Last updated: $Id: index.html,v 1.2 2006/09/07 16:08:05 arim Exp $
E-mail: webmaster @ www-ikn.ist.hokudai.ac.jp