"Mining Sequential Patterns with Regular Expression Constraints"
Abstract
Discovering sequential patterns is an important problem in data mining with a 
host of application domains including medicine, telecommunications,  and the 
World Wide Web.  Conventional  sequential pattern mining systems provide users
with only a very restricted mechanism (based on minimum support) for specifying
patterns  of interest.  As  a  consequence,  the  pattern  mining  process  is
typically characterized by lack of  focus  and  users  often  end  up  paying 
inordinate computational costs just to be inundated with an overwhelming number
of useless results. 
In this paper,  we propose the use of Regular Expressions (REs) as a flexible
constraint  specification  tool that  enables  user-controlled  focus  to be 
incorporated into the pattern  mining  process.  We  develop  a family  of novel
algorithms (termed SPIRIT -- Sequential Pattern mIning with Regular expressIon
consTraints)  for  mining  frequent  sequential  patterns  that  also  satisfy
user-specified  RE  constraints.  The  main  distinguishing  factor  among the 
proposed schemes is the  degree  to which the RE constraints  are  enforced to
prune the search space of patterns during computation.  Our  solutions provide
valuable insights into the tradeoffs that  arise when constraints  that do not
subscribe to nice properties (like  anti-monotonicity) are integrated into the
mining process.  A  quantitative  exploration of these  tradeoffs is conducted 
through an extensive experimental study on synthetic and  real-life data sets. 
The experimental results clearly validate the effectiveness  of our  approach, 
showing that speedups of more than an order of magnitude are possible when RE 
constraints  are pushed deep inside the  mining  process.  Our experimentation 
with real-life data also illustrates the versatility of REs as a user-level tool
for focusing on interesting patterns.
Copyright © 2002, IEEE.
Personal use of this material is permitted. However, permission to 
reprint/republish this material for advertising or promotional purposes or for
creating new collective works for resale or redistribution to servers or lists,
or to reuse any copyrighted component of this work in other works must be 
obtained from the IEEE.