[Go to Hiroki Arimura's home page]
A Simple Word-N-gram Enumeration Tool based on
Word-based Suffix Array and LCP Array
Division of Computer Science,
Graduate School of Information Science and Technology,
WASA is a text mining program for finding all word-N-grams
in a given collections of texts using suffix arrays (word
phrases). As an applicaiton, given a parameter S for minimum
frequency threshold, the program can find all frequent
word-N-grams appearing in no less than S sentences in the
input documents. WASA can also find the best-K word-N-grams
that optimize a certain statistical evaluation function such
as Shannon entropy or classfication measure based on their
occurrences in a pair of positive and negative documents.
Japanese multi-byte texts
WASA is 8-bit clean as minimum requirement. Though I did
not yet implement any further treatment for handling
Japanese texts at this moment, you can handle Japanese texts by WASA as
- Use euc-japan encoding for Japanese texts.
- Separate each Japanese characters in 2bytes by a single
space character in 1byte.
- Run the WASA program on the transformed input file in no
letter-conversion with option '-u 0' and in
word-mode with option '-w' (default).
change back the code of the output file from euc-japan to the original
Japanese Token texts generated by a Japanese Lexcal
You can also use the WASA program to find token sequences
from parsed documents preprocessed by a Japanese lexical
analyzer such as "Chasen" or "Mecab" on Japanese texts as
- Run Chasen on an input Japanese text. Use euc-japan
encoding for the output file.
- Select a target column (e.g., the first column of the
name of a lexical term) in the output file of Chasen. Then,
concatenate all terms in the column into one file with
separated by a space character in 1byte..
- Run the WASA program on the transformed input file with no
letter-conversion (option '-u 0') and in word-mode
(option '-w', default) as well.
- Filter out useless or imappropriate N-grams from
the file of output N-grams by applying a filter.
transform back the output file from euc-japan to the
Last updated: $Id: index.html,v 1.2 2006/09/07 16:08:05 arim Exp $
E-mail: webmaster @ www-ikn.ist.hokudai.ac.jp