Sunday, January 22, 2012

As far as I got

What follows is the partial first draft of my PhD thesis.  I stopped working on it in late 2008 because events overtook me, I guess.  The company I was working for, and doing the research for, at once insisted I stop pursuing my degree (in terms of allowing me the flexibility of schedule to attend school), and also pulled their support for the research (meaning all the work I'd done was not usable as it was based on data from that company).  I'd gotten to ABD, done all my coursework, developed the project code, done the tests and everything, and had spent about two months finding references and starting to type up the dissertation when the plug got pulled, so to speak.  I didn't have any means to continue on my own, either financial, or in terms of reworking the research.

Anyway, I don't figure any good will ever come of this, but I felt like putting it somewhere for posterity.  It's probably not all that interesting reading.

A quick summary of what the project was: I'd developed an engine that would ingest large volumes of text documents (scholarly research articles destined for journal publication, mostly), and 'read' them.  It would sort them into what journal they belonged in, what type of article they were, and then it would do contextual markup on the contents of the documents, reading them for meaning and placing tags with keywords around parts of the document.  This allowed them to be placed in a database and then published automatically, with correct formatting, online or in print.  It was pretty cool.  It would have save that company a lot of money.  I don't think they understood what it was or what it could do.

I haven't bothered to format it or anything, and I've pulled the references, just because.  It's also not complete, for reasons mentioned above.


Self-Calibrating Classification Dave DeHetre
Department of Electrical Engineering and Computer Science University of Kansas
Table of Contents
List of Figures
List of Tables
Chapter 1. Introduction
In the field of classification, there are many methods that are highly successful in a particular domain, however, in the larger picture, and to a certain degree in most real world applications, the goal of classification has to be the development of a general purpose classifier. Such a classifier needs to take the form of a 'what is this?' device; one that can be presented with any input, and produce an identification of what that the input represents. In a sense, the goal is to produce a system that can augment or replace a human being in a decision making process.
In general, this type of classifier should be referred to as a self calibrating classifier, In that it dynamically adapts the methods used and the features considered when classifying. The configuration of the self calibration is an issue that will require some experimentation to perfect, and that is the area of consideration for my research and dissertation.
I propose to establish a set of baseline tests using a spectrum of input types and classification methodologies. This will include mixed domain data as well as single-type and hybrid classification schemes.
Once it has been established what the performance capabilities are of current classifiers, the research should progress to a second stage of building a best-case classifier for the mixed domain data. This will take the form of a compound classifier suite, using the results from the first stage of the research to determine what classification schemes worked best for the general/mixed domain data. This suite will probably use a sequence of classifying steps, doing refinement and branching of methods after each step. This will effectively give a current state of the art classifier, and should be useful in its own right, with any luck being useable as a crude self-calibrating classifier, at least for certain (possibly constrained) domains.
Using the compound system as a baseline, the next step will be to build a classification engine that can adapt and reconfigure on the fly. It will most likely build upon the previous system, but will further the capabilities by using a feedback loop to select and configure the optimum configuration to use with any particular sequence of inputs. This will work in two modes: the first is a setup time configuration, in which the engine is 'trained' in the same way that a neural network is to perform in an optimized way over training data. The second mode is dynamic and run time configuration, adapting as live data comes in and potentially evolves in its format.. This will require some sort of performance feedback system to be implemented.
Basically all of this is currently being done, it is just that much of the configuration and implementation is still domain dependent and is essentially fixed and hard coded. Usually it is also the case that the classification is based around a single, and single step classification scheme.
So: the new art involved in this proposed system is the combination of various classification schemes into a multi-stage classifier, and the dynamic and automatic configuration of the classifier(s).
The motivation for developing a self-calibrating classifier is that, while there are many functional and useful classification schemes in existence, they generally share the attribute of being highly domain dependent. So much so that even noise in the data can completely scuttle these systems.
Current classification systems are of limited usefulness due to the fact that data sets are growing ever larger and more complex and more 'noisy'. There are vast amounts of 'data' available, however, in most cases it is not in a form that allows easy digestion. This seems to be an ideal application for a classifier; to find useful information in large bodies of data, but current methods are too fragile and domain specific to be of much use.
Real intelligences (biological) have very little difficulty adapting to domain changes in data. Imagine, for example: a child is given a long list of flash cards and learns to name which animal appears on each of them. When this child, in the normal testing sequence is presented instead with an ice cream cone, the child can recognize trivially that it is not a flash card but is instead food. There is very little delay before the child responds appropriately and takes the food and eats it. Although it has not been formally tested, it would be very surprising if the child spent any time at all trying to figure out which animal the ice cream cone was.
In this case, the child (biological intelligence) has adeptly switched domains between flash cards and foods. Further, the child quickly recognizes the ice cream as a desirable food and acts accordingly.
This is the impetus for this research, to aggregate various singular classification devices into an overall classification scheme that can serve an automaton in a way that advances the state of the art towards the ultimate goal of parity with biological intelligence and performance.
Specifically it is desirable to construct a classification system that not only recognizes atomic cases in a domain, but that can also do a creditable job of recognizing the domain.
Problem Statement
To construct a self-calibrating or general purpose classifier, certain obstacles must be overcome: to begin with, there is the problem of prior knowledge. How can a classifier detect a domain in a meaningful way when there are so many possible domains of interest? One obvious solution to this is to utilize the many available projects that seek to codify common knowledge, such as the cyc system. Using such a system it would be possible to analyze a set of inputs and compare the detectable/differentiable features to the known features of a set of domains. In effect, the first stage of the classification would be purely to detect a domain. It would not be problematic if the domain was detected as being 'new' or unknown, as this happens frequently with real systems (people). The specifics of domain can either be treated as an abstract, to the end, or until a domain is recognized. It is useful to know that there are certain domains that can be excluded. Further, just knowing the attributes of a domain allows for meaningful feature selection for the instances of that domain, even without knowing what the domain is. For example: a person can easily classify objects based on features without knowing what domain the objects belong to. Using only touch, and without any prior knowledge, a person can separate objects into furry and smooth categories, even if the objects in question come from the sets of garments and pets. The fact that it is necessary to have
further features to detect the dual domains does not reduce the usefulness of the first pass classification.
Broadly, previous work in this area has concentrated in two areas: refinement of the accuracy of the classifying scheme, and selection of features. For the purposes of this proposal, it is assumed that the classifiers themselves are of acceptable utility and that the area of interest is in the selection of correct feature sets.
In preliminary experimentation, the bottleneck for classifying or identifying examples was generally found to be in the features appraised. While not perfect, the state of the art for classifying algorithms works to a success rate that would be useful in practice. -- If the algorithm could be fed with meaningful and distinguishing features.
There have been attempts to automate the selection of features, but these typically are still domain specific. They tend to try to select particular features from a known/predetermined set of features. Generally these use some form of trial and error and statistical ranking of outcomes, but for the most part, there is no effort to discern and remember what the features themselves should be. While this constrained/predetermined method is acceptable and useful for many applications, it represents a source of overhead (in terms of setup and selection of features) as well as error or bias.
The contemplation and establishment of significant features and how to use them in classification is something that can possibly be automated quite well. In addition there is potential for increased performance and utility of application by having the feature determination process integral with the rest of the system.
Research Hypothesis
It is proposed to amalgamate existing classification schemes in such a way that the system can respond in a meaningful and useful way to any sequence of inputs. It will do this based on a layered progressive refinement of classification, both of domain and of example.
It will use a variety of resources/inputs including the input data stream, common knowledge databases, factual databases, classification algorithms, bespoke heuristics, bespoke gathering and discerning tools to detect meaningful features of the domain, and subsequently of examples.
Once a library of domain features was available, the refinement of domain features knowledge could be systematized or even automated. The proposed construction of a classifier would be such that the refinement or clarification process is a distinct module and so could be refined on its own. Starting from manual adjustment, and later including a knowledge database, or perhaps even an outward query system. --This, much like a child, would have a last ditch mode where it merely said it was confused and asked for help. This query itself could be multi-stage and automated, ranging from a halt and query to the operator, and in more complex scenarios it could query search engines, or data resources, or other modules in the automaton (for example, a task history list to establish a likely scenario). This is a very powerful aspect of the design as it allows the system to function even in very sub-optimal situations.
Even in a worst case scenario where the engine can not recognize anything reliably, and has to halt an query the operator for each example, it is doing two valuable things: one is that it is learning by each failure, and so presumably will not have to query so often going forward; the second manner in which it is valuable is that it is presenting the operator with a
classification request that is already digested. The digestion is somewhat situational, but generally the operator is being queried not just with a raw example, but rather with a list of features of that example. At least in some cases, this pre-digestion of the features will allow for a more speedy and reliable classification by the human operator. -- And again, this is in a worst case scenario where the classifier was handing off all of its work to the human operator.
In addition, this functional degradation means that the system is reliable. It can be counted on to work, even from the moment it is turned on because it will always default to the operator query. Since the application domain for such a classifier is to replace a human operator, it can be turned on and 'taught' by the human until such point as it is working autonomously, or, it is determined that it needs to cooperation of a human operator. -- Either case is meritorious and of value.
Once established, it will be possible to refine or upgrade any of the modules involved as a distinct component, allowing for piecewise refinement and improvement as new features and technologies and resources become available.
Chapter 2. Background and Related Work Background: The background for this topic is broad, encompassing primarily the whole field of classification, whether statistical or procedural.
In addition, the work and investigation that is the subject of this proposal builds upon many other areas of research: -Recognition -Data Mining -Modeling and Representation
-Autonomous Agents -World Building -Common Sense -Human/Machine interactions -Document Handling and Processing -and others...
Because of the overarching scope of the background work, no specific items are cited. However: Please see the paper attached which documents an appraisal of the state of the art for document classification schemes (the paper and the investigation were a co-operative endeavor with Martin Kuehnhausen in the Spring of 2008).
Related Work:
The topic methodology proposed is expressly incorporational and high level. Because of this, most of the related works cited are representative of schemes to auto-select features, rather than of the incorporational approach proposed.
The following list of citations represents a cross section of research that is fairly closely related to the proposed research, and is also reasonably current. These papers were selected based on the distinction that the authors stated an intent to automatically determine features as a key attribute of the research.
They are presented here with their abstracts to give a feel for what is being done in the field of automatic feature selection, but again, there is no readily available literature on a multi-method approach as proposed here. Further, because it is the intent of the proposed work to explore this method, it is hard to concretely define what final form or arrangement this will take.

Chapter 3. Research Methodology
Design Overview
The proposed system is in fact an extension of an existing production system used for processing high volume document workflows at a mid-sized publishing company. Because of the complexity of the workflows and the competitive nature of the publishing industry, there is a motivation to have available document processing tools that can alleviate manual work wherever possible. Currently, the production system is limited in capability because it can only discern a very limited amount of information about a document and so has many breakout points where workflows have to be managed by human operators. -- these breakout points are very severe bottlenecks.
The topology of the document system consists of eight major conceptual data stores, over a dozen working centers each with several sub-units doing various types of processing, several dozen intertwined work and/or data flows, over 500 unique computers, over 50 of which are process/control servers of various types and a similar number of which are devoted to storing
and managing the stores of data.
Typical live production data at any given time is approximately ten TeraBytes across several hundred thousand files. Daily throughput can range from 100 GB to over a TB per day. Input data presents itself at any of about two dozen input routes ranging from auto-submisson portals (web pages) that fully manage the process, all the way down to handwritten manuscripts that have to be manually keyed. Further, any given 'job' or article as they are referred to generically in house, can consist of anywhere from 1 to over 100 separate files (text, tables, images, fonts, and so on) and typically these files arrive at separate times through separate input routes.
Work done ranges from organization, storage, archiving, and manipulation of the files themselves, content modification (proofreading, editing, typesetting, format conversion and so on), delivery in various formats (print, on-line, ebooks, reprints, galleys, author proofs, revisions, and so on), artwork preparation and creation, design and layout, issue makeup and other minor services. -- In addition, there is a push to provide more and more value added services.
Every one of the items listed in the previous three paragraphs involves multiple human bottleneck stages, and every one of those bottlenecks that can be sped up or eliminated represents a major savings to the company. There has been an initiative for almost ten years to automate processes, but most of the avenues involved have stagnated or plateaued due to a 'content barrier'. A point is reached where existing automations can not discern between cases when dealing with an object (file).
Example situations: Which article does this table go with? Which issue does this article belong with? Where are all the files that need to go to the pressroom for this issue? What is the author's name in this article? What language is this article in? Are all the references in this article valid and correctly tagged and formatted?
Because of this similarity of cause for many of the barriers to progress, a need for a system that can provide automated understanding of the content of documents (files) has been established. There have been preliminary prototypes developed that use various techniques to provide this content understanding in limited domains, but of greater use would be a generic system that pre-processed all files both in terms of content, but also in terms of placement and belonging in the overall data structure of the company.
To this end, a specification has been laid out for a hypothetical file ingest process. This process would take all files incoming to the organization and immediately normalize and XML tag each text component, and then link together documents based on linking features that are available in the content of the documents. (URL's, names, dates, captions, citations, references, headings, and so on) Most of the automation for this is in place and in production, however, key steps are still done manually, and because of this, the process is slow, inconsistent and error prone.
The target of this proposed research is to develop and test an working document comprehension component that will slot into the automation.
This component, as described above, will consist of one or more production machines dedicated to 'reading' incoming documents, tagging them as needed, and labeling/identifying each in terms of it's identity (journal, issue, etc..). To do this, the component needs to be able to consistently parse an arbitrary digital file to a highly granular level.
Initial discrimination is fairly easy: segregate out image files, artwork files, and other media files, leaving only text. There are existing tools that perform this function very reliably.
The second discrimination acts upon only the text based files. Although these can come in many formats, again, there are already tools in place to normalize any input text document (down to bare text).
The third discrimination requires understanding of content, and is the focus of this research. Given an input set or stream of text documents, how to differentiate between tables, text, math formulae, and garbage? However, this stage actually turns out to be made much easier by pushing it later in the process, after the content analysis.
The fourth stage is the content analysis. Here, the detection is stage-wise done at increasing levels of granularity. First broken apart by major component (paragraph), then by subcomponent (sentence), and them by atoms (words and punctuation). At each of these stages, there is a need for a classification and an identification. However, the nature (arbitrary) of the inputs requires a robustness, generality, and adaptability that was described in the introduction.
Ultimately, the real test for this component is simple. Does it work? Working in this context would mean that it can process, sort, and identify documents at a rate that surpasses the group of human operators by a wide enough margin to justify its development cost. However, there are subtests needed during the development to determine optimum configuration, as well as for monitoring once the system is live. We have the luxury of being able to implement alongside the existing workflow. What this means is that we can have the output of the component parallel the manual component, and they can be compared by the review/proofreading/checking system that already functions for the manual process. Meaning that the apparently trite beginning to this section is in fact fairly robust.
Chapter 4. Research Plan
This section describes both historical and proposed sections of the system. It will hopefully provide a detailed description of the existing tools and structures, as well as how the proposed new component interacts with the existing components. In addition, it will provide a description of the research and evaluation that has already been done, along with details of what is still left to be done.
Tasks and Subtasks
<prior components> Here is a listing of existing software tools and components of the current production system with a brief description of each, and also how it relates to the proposed work:
Ingest portals: -Web page drop boxes
-Ftp uploads -MassTransit -Mail (disc, flash-drive, etc) -Manuscript, OCR, keying.
Normalization filters: -task automations (simple formatting changes, file formats, etc..) -supervised automations: (copyediting macros, document markup macros) -manual copyediting. (copyeditors)
Conversion filters: format conversions: (input to text, text to xml, xml to html, etc..) formatting automations (auto-markup, supervised markup, etc...)
Data management automations dispatching archiving transferring
identification routines automatic document merging. garbage collection and file expiration automatic document linking (DOI's, reference, hyperlinks) automatic linkage checking and verification and buffering.
Document creation tools Semi automatic formatting devices General purpose page layout, editing, and formatting applications (i.e. word processor, page layout, etc...) Constrained tools (i.e. only allow certain modifications to documents to prevent/reduce errors)
<research and evaluation> New work to be done can be broken into three broad categories:
First: acquire, develop, evaluate and implement a complementary set of detection, classification, and identification tools. Second: integrate these tools together using an overall feedback system that continually provides feedback to each component so that it can run optimally. Third: integrate the component into the existing systems.
The design of this system is largely imposed, somewhat organically, by the inherent constraints of the environment it is being developed and will be used in (see prior chapter coverage of the design). As a brief review: In general, this will be a set of component 'engines' each of which does a focused classification or identification. Each will be presented with a digested and appropriate (set of) input(s) and its result will be incorporated into a controlling system that integrates all the component inputs. Probably this will take the form of a weighted voting system, or perhaps in some situations there will be single source tools used. In all cases, the overall management of the system will be driven by manually configured and adjustable heuristics.
There is not much of a concrete nature to put here. The system will be a set of classifiers and control mechanisms working on a copy of production data. It can all live on a singe machine, with an internal store of data. The only external connections will be with a human(s) trainer operator, and possibly with external reference data (this could be external common sense data stores, internet resources (such as for URL verification)). In essence, the system will be self contained in much the same way, and to a similar degree as a human counterpart. (a proofreader, for example)
Stage-wise, much of the utility and convenience programming and set up is already in place. Additionally, most of the component classifiers have been studied an need only to be installed on the development environment. Then, the remains of the bulk of the implementation consists of new development. This is, in detail uncertain, but will, in general take the form of refining interaction between the components, refining the process of sub-evaluation and feedback to the components and providing meaningful and useable interaction with the trainer/operator(s).
In practical terms, this development will be done on a straightforward LAMP system, using perl, PHP, and python in various applications (for expediency, as there is an extant set of tools using these environments), there exists a disparate set of data stores, but for convenience this is/will be ported to a unified data store, most likely a MySQL database (again for expedient reasons). Because of cultural motivations, interfacing will be handled via a web-server/browser interface. This is the system familiar to the workforce in place, and also there is an existing set of tools to be drawn from as time saving resources.
Given that the work is being done in a context and with a target, it makes sense to use examples and sets culled from this domain. As a practical matter, testing and development will be done using a simplified and/or reduced sample size. Further during testing and evaluation it will be necessary to establish a fixed set of training and testing data sets. This is feasible through a simple 'snapshot' of the production data copied onto an offline data repository. Given this resource (already in place) it is a matter of configuration only to generate various sizes and makeups of train/test data. There are already scripts in place to randomly or heuristically generate groups of files to meet certain requirements for development and test purposes.
For evaluation of component parts, as one of the interim steps described above, a standard test methodology will be used, taking input and test and validation data from a common pool and dividing examples at random into the respective groups. Since this is not the ultimate test of this system, and this stage of testing is merely a part of the development process, the rigor and completeness of these test sets will not be elaborated further here other than to say it will be, of necessity, valid and fair.
Once the system is performing to an apparently satisfactory and complete degree, the overall evaluation will take place. Broadly, this will be structured in the following manner:
The system will be test run, using live operators as necessary for the training and guidance, until it has reached an observably stable state. At this point, the live operators will be replaced or constrained by an answer list. This is to control variability of inputs and so remove an ambiguity or potential bias or invalidity of the results. What this means is that the system will not receive further training or hints or biases during the evaluation stage. Because there is a component of external query to the system, this component will be kept active, but rather than getting potentially useful and variable learning inputs, it will instead receive most likely data. What this means is context dependent and is unknown at this time, but generally will take the form of a most average value, or most common, or other best neutral value for the situation at hand. For illustration, a typical case would be a misspelled word causing the engine to become uncertain. In a live situation, an operator might read the surrounding context and infer the problem and 'explain' it to the system. During evaluation however, in this case the input provided might be the result of a spellcheck on the word in question, or, it might be substituted by the most frequent word in the document, or the most similar word in the document. Or in some cases it could be denied feedback – by having the operator signify that the operator did not have an answer either.. Which of these scenarios is most proper will require evaluation of the system in its complete state.
Once properly configured, the evaluation will have four broad questions to answer:
1) How does the system perform compared to selected standard systems? This will probably be somewhat elaborate, as it will make sense to compare to a number of possible alternatives, i.e. commercial products, statistical methods, simple methods. To this end, it will probably be best to merely decouple each component classifier and use it's results as a comparison, thereby identifying whether or not the aggregation has value, or whether it is merely one component that is in fact working well and overwhelming the others.
2) How does the system perform compared to the existing system, or comparable systems? This can be easily answered by recording the elapsed time for the system to process a known set of inputs, and compare that to the known timetable for those inputs as processed through the current production environment. What this method lacks in impartiality, it makes up for in its very well understood and reliable values. In both the original, and evaluated examples the metrics are unambiguous stopwatch figures.
3) How does the system perform in an absolute sense? How does it compare to a skilled human in terms of speed, accuracy, and so on? This question will be difficult to answer in a general way, but there are some metrics of value. One would be a set of evaluations used to score humans. Examples might be performance evaluation criteria for proofreaders or copy editors.. The performance on these criteria for the human population (of proofreaders and copyeditors) is known and available. It would only remain to submit the results from allowing the system to process these same evaluation data sets and scoring the results. Further, depending on the perceived value of such an evaluation, it is within the realm of reasonableness to
Turing test style judged evaluation. It would be a simple matter to pit a person, or set of persons to perform a set of tasks and at the same time have the system do likewise. The results could be re-transcribed by third parties and the resulting work would be indistinguishable (in terms of format)... These results could then be evaluated by experts for their completeness, skill and so on.
4) how does the system perform in a practical sense. What are the costs? How well does it perform relative to the capital resources it requires? Especially compared to alternatives available.
It is expected that this system will actually perform quite badly relative to most of these metrics. and so final assessment will take a general form of XX% of 'real' systems (humans and tools). This number will then have to be considered in light of the potential gains in speed, repeatability, reliability, consistency, training costs, expandability, accountability and so on.
Documentation and write-up of this research will consist mainly of two parts: A detailed description of the working components and their configuration and interaction. A description of the performance capabilities of the working system. In addition there will be sections devoted describing the component parts selected. Why each component was chosen, as compared to approaches not selected. Further, there will be a 'user manual' component that provides description and instruction on the setup, training, usage and maintenance of the system. This will provide a practical resource to the users and testers, but will also provide insight into the system in terms of evaluation.
In terms of progress and logging, the project will carry with it a periodic update log. This log will contain periodic scheduled status reports (weekly) and will also have major milestones, changes, disruptions, and design breakthroughs included in it. This will provide an ongoing feedback amongst development personnel, as well as to oversight entities.
Major milestones: Initial setup of test environment
Initial setup of development environment and tools. Initial selection of classifiers and discriminators (components). Evaluation and relative performance metrics on various combinations of components Design and evaluation of control mechanisms (scripting)
Design and evaluation of controlling heuristics.
Design and evaluation of metrics and feedback systems used to modify behavior of components.
Initial live training phase
Initial live evaluation for stability
Controlled evaluation phase
Preliminary Work <TBD>
Schedule <TBD>
Chapter 6. Resources <TBD> Chapter 7. Trials <TBD>
Chapter 8. Conclusions <TBD> Summary Contributions Limitations
Future Work
References <TBD>

1 comment:

  1. I suppose it bears mentioning that since this, I've been chronically unemployed and unemployable.