FAUSTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Langauge Text

Hobbs, Appelt, Bear, Israel, Kameyama, Stickel, Tyson 1996

FAUSTUS is an information extraction system that uses a cascade of five finite state automata to process texts. It was one of the first systems to show that the information extraction could be performed <emphasis>without</emphasis> in-depth analysis of the syntax and semantics of the text. Instead, it just performs relatively superficial syntactic analysis, and relies on pragmatically-driven templates to put everything together. This makes it simpler, faster, and more robust than many of the competing systems.

The five finite state automata used by FAUSTUS are:

  1. Named entity and complex word recognition.
  2. Chunking (noun phrases, verb phrases, and particles).
  3. Complex phrase construction. This phase basically groups each nouns or verbs with its modifiers (PPs, modals, etc.). The result is an entity/event structure.
  4. Event/relation detection. This phase recognizes sequences of phrases, and assembles them into entity/event structures. It is defined over (head, phrase-type) pairs, such as (formed, PassiveVerbGroup). Several of the patterns in this stage are created automatically using "transformations" from basic patterns (e.g., for passives).
  5. Merging and coreference. This phase merges different events that were detected in the same sentence; and does coreference within and between sentences.

The earlier automata are meant to be relatively domain independant; and the later ones encode more domain-dependant information. But it's not clear from the article exactly how much domain knowledge goes into each automata. E.g., presumably even the first stage needs to know the names that are relevant for the current name (company names, drug names, people, etc).


  author =       {Jerry R. Hobbs, Douglas Appelt, John Bear, David
                  Israel, Megumi Kameyama, Mark E. Stickel and Mabry Tyson},
  title =        {FASTUS: A cascaded finite-state transducer for extracting
                  information from natural-language text},
  booktitle =    {Finite-State Language Processing},
  pages =        {383-406},
  publisher =    {MIT Press},
  year =         1996,
  editor =       {Emmanuel Roche and Yves Schabes}
  url =          {<a href="citeseer.nj.nec.com/hobbs96fastus.html"