KET: Non-formal Essay of Motivation and Ideas
(Revised on April 13, 2006)
In every good detective story there is a subtle moment when the investigator notices almost invisible tiny discrepancy in evidence, which later gives a clue to the puzzle. If you can spot it and unwind to solution, you are Hercules Poirot or Ms. Maple. If you design a program to do that, you are probably creating something like KET.
1. Mystery of Data Analysis
Seventies were fruitful years for data analysis and AI. I was lucky to participate in many interesting projects in the USSR and since 1990 in the US. Some of them, like the "Bronchial Asthma" research under WHO organization, were really big, with rich data sets involved and challenging goals.
The most intriguing to me were cases where I joined the team at later stages and the project was in a bad shape for non-obvious reasons. The team was seemingly well prepared, the chosen approach looked reasonable and available tools adequate, but results of analysis were discouraging, although opinion of experts in the field about data was unanimously high.
It was interesting to participate in discussions about possible rescue approaches. Having mathematical background and being interested in "real-life" applications, I was not satisfied with ad hoc fixes and tried to find typical reasons of difficulties. In essence it was all about understanding of new challenges not captured by formal schemas that we tried to use so far. Initially, I started collecting approaches for troublesome cases, but over years they became a point of view on what analysis means.
It was time when major paradigm shifts visible today in programming, data analysis and AI were emerging.
2. The Crime Scene, Relevance and Assumptions
Paradoxically, the more analysts were confident about ways to go, the worse the final results were. I am talking, of course, about prevailing assumptions, which exist in each field of application. In really tricky cases with the troubleshooting goals, you quickly realize that the right way is probably far from standard.
Another thing that I quickly noticed was a very small portion of relevant data in typical sets. For example, in the "Bronchial Asthma" project we had access to about 1200 variables and only 12 of them (i. e., 1%) were found relevant and worked successfully. Generally, to have only 5% of relevant stuff is not unusual.
I saw two puzzles here. One was how to identify necessary stuff early. Another was how get along with overconfident field experts, who could not solve the problem, but had strong opinions about relevance.
The first question led me to approaches that handle relevance issues in the very beginning, much before any model is suggested. The second radically changed my perception what the analyst should analyze. Over time I was more and more convinced that data and expert opinions should be considered on equal rights and they both should be suspects. Since then, I discussed with clients "the object of investigation", which included all collected evidence regardless of the source.
The picture now looked like a true detective story. We had a mystery, a lot of contradicting evidence, but no clue to what part of it would lead to the solution. Equally, like in many detective novels, obvious answers were misleading.
The major pitfall that I tried to avoid was strong temptation to resort very early to popular data descriptions, like frequently used analytical models, neural nets, etc. They may do a wonderful job but also may put you on a wrong trek. Instead, I had built a battery of tests to check general properties of models and step by step narrowed the circle around the right one. After making this "model round-up" a habit, I started using the term "Model-free approach" (or agenda) as a slogan for cautious and well-tested attitude to model selection. This relates to an interesting special seminar organized in St. Petersburg, which is worth to mention here.
I used to spent considerable time consulting engineers. For a member of staff in math departments of Electrical Engineering Institute, then in Polytechnic Institute and later in the Institute of Informatics (Academy of Sciences), it was quite common. I had a strong summary impression that a big portion of problems brought to me were triggered by my fellow mathematicians during their consulting. It is easy to convince your clients to accept a model convenient for the analyst if they do not see clearly the difference between options.
After talking with many of my colleagues, I saw a lot of interesting projects badly distorted by "quick-and-easy" fixes. At some moment I convinced friends to organize a special free seminar for engineers and mathematicians. There we could hear about needs of our potential clients and collectively discuss "non-violent" approaches. Needless to say, potential consumers of analysis were enthusiastic and to our satisfaction some mathematicians were also very interested in hunting for "essential" models.
3. Witnesses Changing their Mind. Understanding Framework
Intention to treat relational data and expert knowledge on equal rights led to a unified representation of these two types of information. I wanted to go further than knowledge acquisition specialists usually do, when they squeeze expert input into a questionnaire-style relational schema. At least in the beginning knowledge should be represented as experts originally preferred to deliver it, as a set of statements.
The whole conceptual framework of analysis noticeably changed. Instead of data set, the "object of investigation" (OOI) played the central role in the picture. The notion of a variable was understood as a type of request, or measurement, like a physicist would see it.
In this context, such objects as "virtual databases" seemed very natural entities. My favorite situations were those in industrial environment where user had complete illusion of working with a relational database, while in reality queries searched for non-exiting values. The queries were commands to measure something, so that data values were collected after request.
The unified approach to data and expert knowledge naturally forced me to treat critically not only parts of the database, but the expert evidence as well. Knowledge had to be corrected and even the original task could be criticized. Experts may be surprised with strong evidence disproving their opinions. The set of initial goals might be found contradictory or impossible. And this was certainly a very interesting part.
In one big medical diagnostic project, the whole idea was skill transfer from a highly qualified team of specialists. Obviously, it could help in automation support and would make good quality healthcare service available to a broader circle of patients.
It was an interesting moment in the analysis when differences between automatically derived patterns and expert descriptions were under scrutiny. It was decided to submit patient histories, where differences were significant, for additional consideration by specialists. The experts were not aware of reasons why those cases were given to them. We did not want to introduce any bias in their judgment. The result was quite instructive, the shift of opinions in favor of discovered patterns was considerably higher than pure statistical reasons would explain it. The sheer focus on a certain subset led to a new understanding.
Many times afterwards I witnessed such "understanding gain" and was always fascinated with conceptual shifts triggered by feedback from analysis. Actually, with a proper attention to such phenomena, we can view conceptual refinement as one of essential goals. In a well organized "clarification loop", experts gradually become more open to alternatives.
Partially for fun I started calling the collection of corresponding techniques "the UFO approach" which (stands for Understanding FOrmation).
4. The Truth is Out There
Approaches discussed here focus the analyst on structural issues. Not rarely it included discovery of hidden structures in data, which sometimes surprised the customer. In the most intriguing scenarios it pushed analysis beyond initially given information.
In one example of that sort, analysis was done for an environmental inspection agency. It was about pollution in a big river basin. The goal was to discover which industrial and agricultural units (among many along the river) are guilty in violation of environmental regulations.
As you can guess, the enterprises from the list of suspects were not very cooperative in my investigation. I had to judge on the basis of measurements performed not where I wanted them to be, but only at certain available locations. And, of course, the river flows and everything is mixed.
After analysis, when I presented a schema of interdependent events, which showed (what causes what, my clients were puzzled. It looked like events down the river caused change of pollutant concentration up the river. "You are mistaken," they said. "It's absolutely impossible, you have to redo the analysis".
What to say? I agreed but suggested to broaden the investigation and clarify the context of each questionable event. It was found that by logistics of the local environmental agency that every time, when measurement of pollutants down the river were out of normal range, inspectors called to the agency and the latter routinely ordered cleaning measures up the river. The puzzle was solved. It was not a physical force mysteriously acting against the flow. It was dependence by phone.
It is not an exotic case. To view the analyzed object as an open system is quite useful. Maybe, it is even a characteristic of our time. We observe paradigm shifts in many fields now. In programming it is the transition from object-oriented to agent-based systems, in data management technology changes from the storage-centered approach to communication and content focus (virtual and semantic bases), in analysis it shifts from closed to open systems. The latter means that the result can be not a pattern but a query. (It is like in human interaction: when you are asking a question, you may get a counter-question back.)
Once in a while, in the true spirit of Agatha Christie's novels, it goes like this. You have an industrial quality problem. You find a predictive pattern. It is close to the data set informational limits and you are satisfied as an analyst. However, the clients still want it better. To please them you study parts in the set where your predictor is especially wrong. When you come up with a formal descriptor, a pattern emerges, in which certain days of week are presented in abundance. You request all available information about those days beyond the initially given set. You find that in most of unfortunate cases only two concrete people were on duty for maintenance and very soon detect negligence in data collection. And, as always in life, some people are happy and some are not.
5. The Chase is not over
Not surprisingly, a big portion of my effort was design and implementation of concrete software framework and tools to support model-invariant assessment of relevance, search for hidden structures and the open system paradigm. Interesting consequences of that were more clear understanding of self-properties (like ability of software to describe, to evaluate and to modify itself) and their easier implementation. And, of course, the agent technology paradigm, which is currently recognized as a good conceptual ground to unify AI, was in a very good correspondence with all approaches mentioned here.
On the application side, I found that tools for structural discoveries also play well in conjunction with highly specialized tools of third parties. For example, in image analysis you initially may have a very vague idea about successful analysis architecture and relevant tools. After some structural clarification of your task and available data, you sometimes reduce the whole job to a series of low-level operations, which are covered by well-known tools. Needless to say, the best way to go after that is to use your favorite utilities. In this schema, KET plays the role of data preprocessor, selector of tools and advisor.
It would not be fair to my previous experience if I mentioned only these serious matters. A lot of quite entertaining things constantly come out of experiments. In multi-media systems, where relational data, images and natural language pieces are mixed, agents behave unexpectedly like intelligent but unusual creatures. They may vary in sophistication from game-style primitive personages to "personalities" in the style of Commander Data. They may interact between themselves or even criticize their creator. But I should reserve the topic for another article or I will never finish this one.
(C) 2006, Knowledge Extraction Tools, LLC