Phoenix is an information extraction engine developed by the University of Würzburg, Dept. It extracts structured information (e.g. addresses, medical cases, ontologies) from any kind of XML document (e.g. unstructured HTML documents or OpenOffice text documents).
Phoenix identifies blocks of information according to a grammar based upon XPath expressions, regular expressions and grouping expressions for building up blocks containing more than one sub-tree. Rules are applied to these blocks with your own actions in order to gather the contained information and build up result data structures.
[ read more ]