This special issue of Fundamenta Informaticae contains extended versions of 10 articles selected from 64 papers accepted for presentation at the Twelfth International Symposium on Methodologies for Intelligent Systems (ISMIS) in Charlotte, North Carolina, USA, October 11-14, 2000. The ISMIS symposia cover a broad scope of intelligent systems such as evolutionary computation, intelligent information retrieval, knowledge representation and integration, learning and knowledge discovery, logic for AI, and soft computing.
The chosen articles fall into the category of intelligent information systems but their contributions are made from quite diverse perspectives: theory refinement, machine learning, knowledge discovery and data mining, multidimensional data analysis, integration of classifiers, information retrieval, heterogeneous database querying, and multimedia digital libraries.
The first article by S. Kramer, G. Widmer, B. Pfahringer, and M. De Groeve is devoted to the problem of learning to predict ordinal classes using classification and regression trees. They start with S-CART, a tree induction algorithm, and study various ways of transforming it into a learner for ordinal classification tasks. These algorithm variants are compared on a number of benchmark data sets to verify the relative strengths and weaknesses of the strategies and to study the trade-off between optimal categorical classification accuracy and minimum distance-based error.
In the second article, F. Esposito, N. Fanizzi, S. Ferilli and G. Semeraro described a framework for theory refinement operators pursuing the efficiency and effectiveness of learning regarded as a search process for theory refinement. A refinement operator satisfying these requirements is formally defined as ideal. Past results have demonstrated the impossibility of specifying ideal operators in search spaces where standard generalization models, like logical implication or the -subsumption, are adopted. By assuming the object identity bias over a space defined by a clausal language ordered by logical implication, a novel generalization model, named OI-implication, is derived. The authors prove that ideal operators can be defined for the resulting search space.
In the third article, T. Elomaa and J. Rousu probe the inherent complexity of the multisplitting problem, a potential time-consumption bottleneck in machine learning and data mining. They utilize results obtained for similar problems in computational geometry and string matching to the multisplitting task. They show by counterexamples that the widely used evaluation functions Training Set Error and Average Class Entropy do not satisfy the kind of monotonicity that facilitates subquadratic-time optimization. They also show that the Training Set Error function can be decomposed into monotonic subproblems, one per class, which explains its linear time optimization. Finally, they review recently developed techniques for speeding up optimal multisplitting.
The fourth article by M.S. Hacid and F. Toumani considers how constraint-based technology can be used to query semistructured data. They present a formalism based on feature logics for querying semistructured data. The formalism is a hybrid one in the sense that it combines clauses with path constraints. The resulting language has a clear declarative and operational semantics.
In the fifth article, S. Park and W. W. Chu present techniques for discovering and matching rules with elastic patterns. Elastic patterns are useful for discovering rules from data sequences with different sampling rates. For fast discovery of rules whose heads and bodies are elastic patterns, they construct a trimmed suffix tree from succinct forms of data sequences and keep the tree as a compact representation of rules. The trimmed suffix tree is also used as an index structure for finding rules matched to a target head sequence. When matched rules cannot be found, the concept of rule relaxation is introduced. Using a cluster hierarchy and relaxation error as a new distance function, they find the least relaxed rules that provide the most specific information on a target head sequence.
The sixth article by S. Puuronen and A.Tsymbal addressed the problem of feature-space heterogeneity in multidimensional data. They introduce a technique that provides a strategic splitting of the instance space to identify the best subset of features for each instance to be classified. Their technique applies the wrapper approach where a classification algorithm is used as an evaluation function to differentiate between different feature subsets. In order to make the feature selection local, they apply the dynamic integration of classifiers to determine what classifier and which feature subset should be used for each new instance. In order to restrict the number of feature combinations being analyzed, they propose the use of decision trees. For each new instance, they only consider those feature combinations that include the features present in the path taken by the new instance in the decision tree built on the whole feature set. In their experiments, they use the C4.5 algorithm as the learning algorithm for base classifiers and for the decision trees that guide the local feature selection.
In the seventh article, M. Kim, J. S. Deogun, and V. V. Raghavan address problems in Rule Based Information Retrieval by Computer (RUBRIC). RUBRIC involves the use of production rules to capture user-query concepts. A set of related production rules is represented as an AND/OR tree, or alternatively by a disjunction of Minimal Term Sets (MTSs) .The retrieval output is determined by the evaluation of the weighted Boolean expressions of the AND/OR tree, and processing efficiency can be enhanced by employing MTSs. However, since the weighted Boolean expression ignores the term-term association unless it is explicitly represented in the tree, the terminological gap between users' queries and their information needs may still remain. To solve this problem, they adopt the generalized vector space model (GVSM) and the p-norm based extended Boolean model.
In the eighth article, C. Fernandes and L. Henschen propose a system whereby subtle semantic ambiguity found in queries of distributed heterogeneous database systems can be resolved by considering the user's intentions. Through the use of domain- specific knowledge embedded within a mediator-based architecture, subtleties in meaning can be explicitly modeled. Through the use of dynamic profiles and active dialogue, their system can discover user intent, providing more satisfying query answers.
The ninth article by E. Bertino, B. Catania, and G. P. Zarri propose a two-steps annotation approach by which conceptual annotations, represented in Narrative Knowledge Representation Language (NKRL), are associated with multimedia documents and used during retrieval operations. Metadata represent the vehicle by which digital documents can be efficiently indexed and retrieved in multimedia digital libraries. A relevant metadata function is created by superimposing some sort of conceptual organization over the unstructured information space proper to these digital repositories, in order to facilitate the intelligent retrieval of the original documents. The authors present how documents and metadata can be stored and managed on persistent storage.
The last article by A. Wieczorkowska describes the application of wavelet transform to the automatic classification of musical instrument sounds. Eleven instruments are described by a feature vector consisting of 152 attributes (parameters). Two classifiers are constructed using C4.5 algorithm and DataLogic/R. Finally, the correctness of the two classifiers using cross validation method is discussed .
E. El-Kwae, Guest Editor | Z. Raś, Guest Editor |
University of North Carolina at Charlotte, | University of North Carolina at Charlotte, |
Department of Computer Science 9201 | Department of Computer Science 9201 |
University City Boulevard, | University City Boulevard, |
Charlotte, NC 28223 | Charlotte, NC 28223 |
and | |
Polish Academy of Sciences, | |
Institute of Computer Science | |
01-237 Warsaw, Poland |
This paper is devoted to the problem of learning to predict ordinal (i.e., ordered discrete) classes using classification and regression trees. We start with S-CART, a tree induction algorithm, and study various ways of transforming it into a learner for ordinal classification tasks. These algorithm variants are compared on a number of benchmark data sets to verify the relative strengths and weaknesses of the strategies and to study the trade-off between optimal categorical classification accuracy (hit rate) and minimum distance-based error. Preliminary results indicate that this is a promising avenue towards algorithms that combine aspects of classification and regression.
A framework for theory refinement is presented pursuing the efficiency and effectiveness of learning regarded as a search process. A refinement operator satisfying these requirements is formally defined as ideal. Past results have demonstrated the impossibility of specifying ideal operators in search spaces where standard generalization models, like logical implication or q-subsumption, are adopted. By assuming the object identity bias over a space defined by a clausal language ordered by logical implication, a novel generalization model, named OI-implication, is derived and we prove that ideal operators can be defined for the resulting search space.
The need to partition or discretize numeric value ranges arises in machine learning and data mining algorithms. This subtask is a potential time-consumption bottleneck, since the number of candidate partitions is exponential in the number of possible cut points in the range. Thus, many heuristic algorithms have been proposed for this task.
Recently, the efficiency of optimal multisplitting has improved dramatically, due to the introduction of linear-time algorithms for training error minimization and quadratic-time generic algorithms. Whether these efficient algorithms are the best obtainable, is not yet known. In this paper, we probe the inherent complexity of the multisplitting problem.
We reflect results obtained for similar problems in computational geometry and string matching to the multisplitting task. Subquadratic optimization algorithms in computational geometry rely on the monotonicity of the optimized function. We show by counterexamples that the widely used evaluation functions Training Set Error and Average Class Entropy do not satisfy the kind of monotonicity that facilitates subquadratic-time optimization. However, we also show that the Training Set Error function can be decomposed into monotonic subproblems, one per class, which explains its linear time optimization. Finally, we review recently developed techniques for speeding up optimal multisplitting.
In this paper we consider how constraint-based technology can be used to query semistructured data. As many concerns in semistructured data (e.g., representing and retrieving) are also found in computational linguistics, this last area could provide an interesting angle to attack some of the problems regarding semistructured data. We present a formalism based on feature logics^{1} for querying semistructured data. The formalism is a hybrid one in the sense that it combines clauses with path constraints. The resulting language has a clear declarative and operational semantics based on the notion of extended active domain.
This paper presents techniques for discovering and matching rules with elastic patterns. Elastic patterns are ordered lists of elements that can be stretched along the time axis. Elastic patterns are useful for discovering rules from data sequences with different sampling rates. For fast discovery of rules whose heads (left-hand sides) and bodies (right-hand sides) are elastic patterns, we construct a trimmed suffix tree from succinct forms of data sequences and keep the tree as a compact representation of rules. The trimmed suffix tree is also used as an index structure for finding rules matched to a target head sequence. When matched rules cannot be found, the concept of rule relaxation is introduced. Using a cluster hierarchy and relaxation error as a new distance function, we find the least relaxed rules that provide the most specific information on a target head sequence. Experiments on synthetic data sequences reveal the effectiveness of our proposed approach.
Multidimensional data is often feature space heterogeneous so that individual features have unequal importance in different sub areas of the feature space. This motivates to search for a technique that provides a strategic splitting of the instance space being able to identify the best subset of features for each instance to be classified. Our technique applies the wrapper approach where a classification algorithm is used as an evaluation function to differentiate between different feature subsets. In order to make the feature selection local, we apply the recent technique for dynamic integration of classifiers. This allows to determine which classifier and which feature subset should be used for each new instance. Decision trees are used to help to restrict the number of feature combinations analyzed. For each new instance we consider only those feature combinations that include the features present in the path taken by the new instance in the decision tree built on the whole feature set. We evaluate our technique on data sets from the UCI machine learning repository. In our experiments, we use the C4.5 algorithm as the learning algorithm for base classifiers and for the decision trees that guide the local feature selection. The experiments show some advantages of the local feature selection with dynamic integration of classifiers in comparison with the selection of one feature subset for the whole space.
One of the essential goals in information retrieval is to bridge the gap between the way users would prefer to specify their information needs and the way queries are required to be expressed. Rule Based Information Retrieval by Computer (RUBRIC) is one of the approaches proposed to achieve this goal. This approach involves the use of production rules to capture user-query concepts (or topics). In RUBRIC, a set of related production rules is represented as an AND/OR tree, or alternatively by a disjunction of Minimal Term Sets (MTSs). The retrieval output is determined by the evaluation of the weighted Boolean expressions of the AND/OR tree, and processing efficiency can be enhanced by employing MTSs. However, since the weighted Boolean expression ignores the term-term association unless it is explicitly represented in the tree, the terminological gap between users' queries and their information needs may still remain. To solve this problem, we adopt the generalized vector space model (GVSM) and the p-norm based extended Boolean model. Experiments are performed for two variations of the RUBRIC model, extended with GVSM, as well as for the integrated use of RUBRIC with the p-norm based extended Boolean model. The results are compared to the original RUBRIC model based on recall-precision.
We propose a system whereby subtle semantic ambiguity found in queries of distributed heterogeneous database systems can be resolved by considering the user's intentions. Through the use of domain-specific knowledge embedded within a mediator-based architecture, subtleties in meaning can be explicitly modeled. Through the use of dynamic profiles and active dialogue, the system can discover user intent, providing more satisfying query answers.
Metadata represent the vehicle by which digital documents can be efficiently indexed and retrieved. The need for such kind of information is particularly evident in multimedia digital libraries, which store documents dealing with different types of media (text, images, sound, video). In this context, a relevant metadata function consists in superimposing some sort of conceptual organization over the unstructured information space proper to these digital repositories, in order to facilitate the intelligent retrieval of the original documents. To this purpose, the usage of conceptual annotations seems quite promising. In this paper, we propose a two-steps annotation approach by which conceptual annotations, represented in NKRL (Narrative Knowledge Representation Language) [20,21], are associated with multimedia documents and used during retrieval operations. We then present how documents and metadata can be stored and managed on persistent storage.
Contents-based searching through audio data is basically restricted to metadata, which are attached manually to the file. Otherwise, users have to look for the specific musical information alone. Nevertheles, when classifiers based on descriptors extracted from sounds analytically are used, automatic classification can be in some cases possible. For instance, wavelet analysis can be used as a basis for automatic classification of audio data. In this paper, classification of musical instrument sounds based on wavelet parameterization is described. Decision trees and rough set based algorithms are used as classification tools. The parameterization is very simple, but the efficiency of classification proves that automatic classification of these sounds is possible.
This is the third special issue of Fundamenta Informaticae devoted to the CONCURRENCY SPECIFICATION AND PROGRAMMING (CS&P) workshop, in succession to the second one published in August 2000. The volume contains a selection of papers presented at the meeting that took place in Berlin from 9 to 11 of October 2000. The papers have been chosen from 35 contributions accepted for presentation at CS&P'2000, following their additional evaluation and editorial treatment. A complete collection of the contributions has been published as Proceedings of this workshop by the Humboldt Universität zu Berlin, Informatik Bericht Nr.140. That continued the tradition of the former CS&P workshops, whose participants had been supplied with proceedings in the form of technical reports during the meetings. The contributions encompass various topics, like concurrency issues, timed automata, rough set theory, formal linguistics, or case based reasoning.
The CS&P workshops, being held in even years in Germany and in odd years in Poland, are supported by Humboldt and Warsaw Universities on the basis of an exchange programme. Initiated by computer science interest groups affiliated to both institutions in the mid-seventies, the workshops were suspended for some years in the eighties and resumed in 1992 in the extended form of participation: they evolved from purely bilateral meetings to ones gathering researchers from countries other than Germany and Poland as well. The scope of subjects has also been broadened from linguistic and logical issues initially, to semantic and logical concepts in concurrency, specification and, in general, programming at present, though with complexity analysis almost entirely left aside. The name CS&P, assigned to this annual event after its reactivating, reflected the actual subject of interest of the German and Polish initiative groups that, in the meantime, developed into the Institutes of Informatics of Humboldt and Warsaw Universities.
The organisers would like to thank all the authors and referees for their contribution as well as those colleagues from Humboldt and Warsaw Universities, who helped in making the CS&P'2000 meeting possible.
Editors
Hans-Dieter Burkhard and Peter Starke | Ludwik Czaja |
Humboldt Universität zu Berlin | Warsaw University |
Institut für Informatik, | Institute of Informatics |
Unter den Linden 6, D-10099 Berlin, Germany | Banacha 2, 02-097 Warsaw, Poland |
In this paper we propose a model, timed automata with non-instantaneous actions, which allows representing in a suitable way real-time systems. Timed automata with non-instantaneous actions extend the timed automata model by dropping the assumption that actions are instantaneous: in our model an action can take some time to be completed. We investigate the expressiveness of the new model, comparing it with classical timed automata. In particular, we study the set of timed languages which can be accepted by timed automata with non-instantaneous actions. We prove that timed automata with non-instantaneous actions are more expressive than timed automata and less expressive than timed automata with e edges. Moreover we define the parallel composition of timed automata with non-instantaneous actions. We point out how the specification by means of a parallel timed automaton with non-instantaneous actions is, in some cases, more convenient to represent reality.
Notions of similarity and distance play an important role in informatics. Different disciplines have developed their own treatment of related measures. We consider this problem under the viewpoint of Case Based Reasoning. While distance and similarity can be considered to be formally equivalent, there exist some differences concerning their intuitive use which have impact to the composition of global measures from local ones.
The definition of process admitted here follows the line developed for elementary (1-safe) Petri nets and published in [Cza 99], [Cza 2000a], [Cza-Kud 2000]. It pertains not to any particular net, thus allows for collecting processes into arbitrary sets, i.e. process languages, and for asking questions like: for a given process language decide if there exists a Place/Transition net and if yes, contruct it (synthesis). The collection of all process languages is a semantic domain for Place/Transition nets. w-process languages contain both finite and infinite processes. The main problems pursued are analysis, synthesis and iteration lemmata for w-process languages. Surprisingly, the problems enjoy much simpler solutions for processes generated by P/T nets than generated by elementary nets.
We study four solutions to the reachability problem for 1-safe Petri nets, all of them based on the unfolding technique. We define the problem as follows: given a set of places of the net, determine if some reachable marking puts a token in all of them. Three of the solutions to the problem are taken from the literature [McM92,Mel98,Hel99], while the fourth one is first introduced here. The new solution shows that the problem can be solved in time O(n^{k}), where n is the size of the prefix of the unfolding containing all reachable states, and k is the number of places which should hold a token. We compare all four solutions on a set of examples, and extract a recommendation on which algorithms should be used and which ones not.
In recent years many authors have come up with definitions of object Petri nets. They can be divided into two main classes: those that try to model object-orientation within a framework of Petri nets, and those modelling the token objects of an environment net by Petri nets but do not follow the paradigm of object-orientation. We compare some aspects of the latter kind with the foundations of Linear Logic and Linear Logic Petri nets, establishing some simulations between the formalisms.
We study independence of events in the unfoldings of Petri nets, in particular indirect influences between concurrent events: Confusion in the sense of Smith [11] and weak interference. The role of structural (conflict) clusters is investigated, and a new unfolding semantics based on clusters motivated and introduced.
The paper pursues the investigation of Timed Cooperating Automata (TCAs) by studying transformations which are suggested as means for stepwise TCA construction.
Nested Petri nets (NP-nets) is a formalism for modeling hierarchical multi-agent systems. Tokens in nested Petri nets are elements represented by nets themselves. The paper continues investigating semantic properties of NP-nets, started in [10], where two-level NP-nets were studied. Here we consider multi-level and recursive NP-nets, for which decidability of termination and some other properties are proved. A comparison of NP-nets with other Petri net models is given.
This paper considers the application of Petri nets with timestamps to the problem of disassembly scheduling with parts commonality, which is the problem of determining time and quantity of ordering the used product to fulfill the demands of individual disassembled parts. Unlike the simple case, parts commonality creates dependencies of components and makes them difficult to solve. A comparison of methods using an example problem from a previous research shows that the method suggested in this paper is much simpler and more extendable than the previous one. Also, we show that the solutions obtained from the previous algorithm are not optimal in general.
This paper considers models of sensors, filters, and sensor fusion with Petri nets defined in the context of rough sets. Sensors and filters are fundamental computational units in the design of systems. The intent of this work is to construct Petri nets to simulate conditional computation in approximate reasoning systems, which are dependent on filtered input from selected sensors considered relevant in problem solving. In this paper, coloured Petri nets provide a computational framework for the definition of a family of Petri nets based on rough set theory. Sensors are modeled with what are known as receptor processes in rough Petri nets. Filters are modeled as ukasiewicz guards on some transitions in rough Petri nets. A ukasiewicz guard is defined in the context of multivalued logic. ukasiewicz guards are useful in culling from a collection of active sensors those sensors with the greatest relevance in a problem-solving effort such as classification of a "perceived" phenomenon in the environment of an agent. The relevance of a sensor is computed using a discrete rough integral. The form of sensor fusion considered in this paper consists in selecting only those sensors considered relevant in solving a problem. The contribution of this paper is the modeling of sensors, filters, and fusion in the context of receptor processes, ukasiewicz guards, and rough integration, respectively.
Given a (possibly partially defined) state, all count vectors of transition sequences reaching that state are solutions to a corresponding Petri net state equation. We propose a search strategy where sequences corresponding to a minimal solution of the state equation are explored first. Then step by step the search space is relaxed to arbitrary count vectors. This heuristics relies on the observation that in many (though provably not in all) cases, minimal solutions of the state equation can be realized as a firing sequence. If no target state is reachable, either the state equation does not have solutions, or our search method would yield the full state space.
We study the impact of the state equation on reachability, present an algorithm that exploits information from the state equation and discuss its application in stateless search as well as its combination with stubborn set reduction.
Information sources provide us with granules of information that must be transformed, analyzed and built into structures that support problem solving. One of the main goals of information granule calculi is to develop algorithmic methods for construction of complex information granules from elementary ones by means of available operations and inclusion (closeness) measures. These constructed complex granules represent a form of information fusion. Such granules should satisfy some constraints like quality criteria or/and degrees of granule inclusion in (closeness to) a given information granule. Information granule decomposition methods are important components of those methods. We discuss some information granule decomposition methods.
The problem of improving rough set based expert systems by modifying a notion of reduct is discussed. The notion of approximate reduct is introduced, as well as some proposals of quality measure for such a reduct. The complete classifying system based on approximate reducts is presented and discussed. It is proved that the problem of finding optimal set of classifying agents based on approximate reducts is NP-hard; the genetic algorithm is applied to find the suboptimal set. Experimental results show that the classifying system is effective and relatively fast.