This package contains an English POS tagger described in 
"Maximum Entropy Tagger with Unsupervised Hidden Markov Models" 
Jun'ichi Kazama, Yusuke Miyao, and Jun'ichi Tsujii NLPRS2001.

COMPATIBLE AND TESTED WITH: GCC3.2 and Amis3(0.29, as of 2003/6/12)

./Data directory contains model parameters (HMM and ME model) trained with the WSJ corpus.

To run the tagger, 
% ruby tagger2.rb (tagger2genia.rb for GENIA-specialized version)
% [usage will be printed]
using script "tagger2.rb" in ./Ruby directory. This driver requires the Ruby language (http://www.ruby-lang.org).

A typical usage is as follows:
% ruby tagger2.rb stdin stdout
This enables interactive tagging. After the program outputs "Ready", input a sentence and return.

or

% ruby tagger2.rb <infile> <outfile>
This analyzes <infile> and outputs the results to <outfile>

This package also contains the code for training HMMs and taggers. But to train a tagger you need the Amis package for the training of ME models which is not contained in the package. If you really train your own model, please contact the author.


COMPILE (optional):
This package contains binaries for Linux/x86 in ./bin directory and the tagger uses this by default. If you would like to compile the programs on your machine (Linux and Solaris are tested), go to ./C++ directory then type:

% ./configure --prefix=<this_pakage>
% make
% make install
(you may have to run automake, autconf, and autoheader to make the correct Makefile)

This will install the binaries in ./bin directory. 

LEAGAL ISSUE:
All rights reserved by Jun'ichi Kazama. I apply the Artistic License (see COPYGIN file) to my codes.  However, please note that this package contains materials from other packages. For these materials their own copyright statements apply.

List of materials from others:

Data/GENIA/* -- From the GENIA corpus
Data/WSJ/*   -- 
Data/WSJ2/*  -- From the Penn Treebank
(I don't know who own the right for the parameters estimated from a corpus)

C++/bfstream.{h,cpp} - From the book by Tom Swan
C++/mt19937-1.cpp  -- From an excellent random number generator by Makoto Matsumoto and Takuji Nishimura.


====================================================================
[Training]

[Training of HMMs]
- The name: "bwm" (the Baum-Welch algorithm)
- Don't trust the help messages printed by "bwm --help"!
  (There are number of "dead" options)
- We recommend to use a Gibbs sampling version of the algorithm enabled by
  "-gibbs" option.

.... not yet tested ....
.... explanation continues ...

[Training of ME models]
- You require:
  - an HMM (generated by the above training)
  - the config file
  - raw sentence file (raw mseq format)
  - sentence file     (symbol mseq format)
  - correct tag file  (symbol mseq format)

(config file)
In the config file, you specify feature groups the model
uses, and the options for each feature group. For example,
==================================================
# Close to Ratnaparkhi's feature set + states
StateNGram  3:0 2:0 1:-1,0,1  %0.0,10,0.0,0.0
SymbolNGram 1:-2,-1,0,1,2 %0.0,10,0.0,0.0
Prefix 4:0 3:0 2:0 1:0    %0.0,10,0.0,0.0
Suffix 4:0 3:0 2:0 1:0    %0.0,10,0.0,0.0
Number 1:0                %0.0,10,0.0,0.0
UpperCase 1:0             %0.0,10,0.0,0.0
Hyphen    1:0             %0.0,10,0.0,0.0
PrevFuture                %0.0,10,0.0,0.0
PrevPrevFuture            %0.0,10,0.0,0.0
==================================================

The format is usually as follows
<feature_group> <position info> %<?>,<cut-off threshold>,<?>,<?>

For example, the line:
----
StateNGram  3:0 2:0 1:-1,0,1  %0.0,10,0.0,0.0
----
means
 - use n-gram of the Viterbi sequence of the HMM 
 - trigram at position 0 (0 means current word, minus means preceding words)
                         (the last of the n-gram is specified as its position.
                           In this case, trigram of <-2,-1,0> is used.)
 - bigram at position 0
 - ungram at position -1, 0, and +1
 - %0.0,10,0.0,0.0 means that for this group
   we omit features occuring less than 10 times.
   (please ignore specifications other than the second specification at now
    since they are not tested nor used.)

(symbol mseq format)
This format is for expressing the text (a sequence of sentences)
and annotations (a sequence of tag sequences) as
a sequence of (integer) symbol sequences.

The format is:
----
T= <# total symbols> SeqNum= <# sentences> M= <kinds of symbols>\n
(<sym in [0 ... M-1]>+\n)+
----

For example,
==========================================================
T= 148845 SeqNum= 5747 M= 10001
10000 616 202 7 397 6 14 33 129 10000
5 3980 183 2 9 98 7382 5 27 518 105 4 14 51 142 5 1 157 0
18 167 4 69 5 357 1363 184 3
44 45 231 1 102 0 397 34 98 1703 12 149 226 358 893 10 7
5087 6 10000 496 30 518 165 105 3
.....
==========================================================
means this file contains 148845 symbols as a whole; the number of
sentences expressed is 5757, and the size of vocabulary is 10001
(the last one may be for unknown word)

Note that the equals and spaces in the header file must be exactly
as shown above (i.e, don't place spaces like "T = 1111" and don't omit 
spaces like "T=1111").

Users must remenber the mapping: word -> integer in a file for future use.

The corresponding tags must be expressed by this format as a separate
file. (.mqs)

(raw mseq format)
This format is for expressing the text as a sequence of string sequences
to be used as the source of information for such features as "suffix" and "prefix".

The format is equevalent except that the symbol can be arbitorary string,
and M= <n> is ignored.

(Training an ME tagging model)
Example:
============================================================
> makesmap_real -M 160.wo.t10k.r.g-20-20.im100.hmm -me_conf conf.mxpost2.st -raw tate1.train.raw.mseq tate1.train.mseq tate1.train.mqs 160.wo.t10k.r.g-20-20.itr100.memap.conf.mxpost2.st.itr250.smap &
============================================================

You will obtain an ME model (.smap file) which contains all the configuration and the parameters.

(Using the model as a tagger)
Example:
=============================================================
tagging -M 160.wo.t10k.r.g-20-20.im100.hmm -memap 160.wo.t10k.r.g-20-20.itr100.memap.conf.mxpost2.st.itr250.smap -raw tate1.dev.raw.mseq tate1.dev.mseq tate1.dev.vit.mqs
============================================================
assuming that tate1.dev.raw.mseq and tate1.dev.mseq are the raw and symbol mseq format of the text to be analyzed, ".smap" is the estimated ME model, and ".hmm" is the HMM used when the ME model was estimated.


(Training an HMM) --- training using the Baum-Welch algorithm

example:
==================================================================
bwm -r -N 160 -M 23571 -itr_max 100 -gibbs -gibbs_trial 1 -gibbs-pre 20 -gibbs-sample 20 train.mseq 160.r.g.100.hmm >&! 160.r.g-20-20.im100.hmm.err.log &
================================================================== 
 -r means random initialization
 -N is the number of HMM states
 -M is the number of symbols (words) See "symbol mseq format" above.
 -itr_max is the number of training iteration
 -gibbs specifies to use Gibbs sampling to speed up the training.
 -gibbs_pre, -gibbs_sample are parameters for Gibbs sampling 
  (you might not need to change these from the above setting)
 the second last argument is the document in "symbol mseq format".
 the last argument is the name of the output HMM.

  (note: if you don't set -gibbs, the standard Baume-Welch algorithm will be used)


(Run the HMM) --- Find the most probable state sequences (the Viterbi path)

======================================
viterbim -M 160.r.g.100.hmm test.mseq test.vit.mqs
======================================

 -M specifies the HMM to be used
 the second last argument is the document in "symbol mseq format"
 (M= should be same as when training the HMM)
 the last argument is the name of the output viterbi sequences 
(in mseq format; but a symbol means a state of the HMM in this case).

Jun'ichi Kazama
kazama@is.s.u-tokyo.ac.jp










