98 (1) 2010
- Special Issue on Intelligent Data Analysis in Granular Computing. Preface i-ii
- Analysis of the Effectiveness of the Genetic Algorithms based on Extraction of Association Rules
Jesus Alcala-Fdez, Nicolo Flugy-Pape, Andrea Bonarini, Francisco Herrera 1-14
Data Mining is most commonly used in attempts to induce association rules from transaction data which can help decision-makers easily analyze the data and make good decisions regarding the domains concerned. Most conventional studies are focused on binary or discrete-valued transaction data, however the data in real-world applications usually consists of quantitative values. In the last years, many researches have proposed Genetic Algorithms for mining interesting association rules from quantitative data. In this paper, we present a study of three genetic association rules extraction methods to show their effectiveness for mining quantitative association rules. Experimental results over two real-world databases are showed.
- Scalable Clustering for Mining Local-Correlated Clusters in High
Dimensions and Large Datasets
Kun-Che Lu, Don-Lin Yang 15-32
Clustering is useful for mining the underlying structure of a dataset in
order to support decision making since target or high-risk groups can be
identified. However, for high dimensional datasets, the result of
traditional clustering methods can be meaningless as clusters may only be
depicted with respect to a small part of features. Taking customer datasets
as an example, certain customers may correlate with their salary
and education, and the others may correlate with their job
and house location. If one uses all the features of a customer for
clustering, these local-correlated clusters may not be revealed. In
addition, processing high dimensions and large datasets is a challenging
problem in decision making. Searching all the combinations of every feature
with every record to extract local-correlated clusters is infeasible, which
is in exponential scale in terms of data dimensionality and cardinality. In
this paper, we propose a scalable 2-Leveled Approximated Hyper-Image-based
Clustering framework, referred as 2L-HIC-A, for mining local-correlated
clusters, where each level clustering process requires only one scan of the
original dataset. Moreover, the data-processing time of 2L-HIC-A can be
independent of the input data size. In 2L-HIC-A, various well-developed
image processing techniques can be exploited for mining clusters. In stead
of proposing a new clustering algorithm, our framework can accommodate other
clustering methods for mining local-corrected clusters, and to shed new
light on the existing clustering techniques.
- Discovering the Radio Signal Coverage Hole and Weak Coverage Area
in Mobile Network by Spatiotemporal Data Mining on Location-Based Services
Wei-Hsun Lee, Shian-Shyong Tseng, Ching-Hung Wang, Shen-Lung Tung 33-47
Locating the radio signal coverage hole (SCH) and signal weak coverage area (SWCA) in the mobile network and taking appropriate actions to improve network quality are the major tasks for mobile network operators (MNO). A novel approach, Signal Coverage Hole and weak coverage Area DIscovering Model (SCHADIM), is proposed based on spatiotemporal data mining on the raw data collecting from location based service (LBS) to achieve this goal. It reuses the communication raw data in LBS-based applications and transforms it into mobile network communication quality monitoring information by integrating the GIS (geographical information system), road network database and mobile network base station database. By this way, the vehicles in the LBS-based applications can be regarded as the mobile network signal probing vehicles, which provides plentiful information for discovering the SCH/SWCA. Comparing to the traditional mobile network signal probing vehicle method, which is known as radio site verification (RSV) method, the proposed SCHADIM has the spatiotemporal coverage as well as cost advantages. A mobile network monitoring system, cellular network quality safeguard (CQS), has been implemented based on the proposed model which combines with the domain expert heuristics to provide decision support information for optimizing the mobile network.
- Predicting Website Audience Demographics for Web Advertising Targeting Using
Multi-Website Clickstream Data
Koen W. De Bock, Dirk Van den Poel 49-67
Several recent studies have explored the virtues of behavioral targeting and personalization for online advertising. In this paper, we add to this literature by proposing a cost-effective methodology for the prediction of demographic website visitor profiles that can be used for web advertising targeting purposes. The methodology involves the transformation of website visitors' clickstream patterns to a set of features and the training of Random Forest classifiers that generate predictions for gender, age, level of education and occupation category. These demographic predictions can support online advertisement targeting (i) as an additional input in personalized advertising or behavioral targeting, or (ii) as an input for aggregated demographic website visitor profiles that support marketing managers in selecting websites and achieving an optimal correspondence between target groups and website audience composition. The proposed methodology is validated using data from a Belgian web metrics company. The results reveal that Random Forests demonstrate superior classification performance over a set of benchmark algorithms. Further, the ability of the model set to generate representative demographic website audience profiles is assessed. The stability of the models over time is demonstrated using out-of-period data.
- Mining Outliers in Correlated Subspaces for High Dimensional
Data Sets
Jinsong Leng, Tzung-Pei Hong 71-86
Outlier detection in high dimensional data sets is a challenging data mining task. Mining outliers in subspaces seems to be a promising solution, because outliers may be embedded in some interesting subspaces. Searching for all possible subspaces can lead to the problem called "the curse of dimensionality". Due to the existence of many irrelevant dimensions in high dimensional data sets, it is of paramount importance to eliminate the irrelevant or unimportant dimensions and identify interesting subspaces with strong correlation. Normally, the correlation among dimensions can be determined by traditional feature selection techniques or subspace-based clustering methods. The dimension-growth subspace clustering techniques can find interesting subspaces in relatively lower dimension spaces, while dimension-reduction approaches try to group interesting subspaces with larger dimensions. This paper aims to investigate the possibility of detecting outliers in correlated subspaces. We present a novel approach by identifying outliers in the correlated subspaces. The degree of correlation among dimensions is measured in terms of the mean squared residue. In doing so, we employ a dimension-reduction method to find the correlated subspaces. Based on the correlated subspaces obtained, we introduce another criterion called "shape factor" to rank most important subspaces in the projected subspaces. Finally, outliers are distinguished from most important subspaces by using classical outlier detection techniques. Empirical studies show that the proposed approach can identify outliers effectively in high dimensional data sets.
- Covariance Matrix Self-Adaptation and Kernel Regression - Perspectives
of Evolutionary Optimization in Kernel Machines
Oliver Kramer 87-106
Kernel based techniques have shown outstanding success in data mining and machine learning in the recent past. Many optimization problems of kernel based methods suffer from multiple local optima. Evolution strategies have grown to successful methods in non-convex optimization. This work shows how both areas can profit from each other. We investigate the application of evolution strategies to Nadaraya-Watson based kernel regression and vice versa. The Nadaraya-Watson estimator is used as meta-model during optimization with the covariance matrix self-adaptation evolution strategy. An experimental analysis evaluates the meta-model assisted optimization process on a set of test functions and investigates model sizes and the balance between objective function evaluations on the real function and on the surrogate. In turn, evolution strategies can be used to optimize the embedded optimization problem of unsupervised kernel regression. The latter is fairly parameter dependent, and minimization of the data space reconstruction error is an optimization problem with numerous local optima. We propose an evolution strategy based unsupervised kernel regression method to solve the embedded learning problem. Furthermore, we tune the novel method by means of the parameter tuning technique sequential parameter optimization.
- Risk Mining in Medicine: Application of Data Mining to Medical Risk Management
Shusaku Tsumoto, Shoji Hirano 107-121
This paper proposes an application of data mining to
medical risk management, where data mining
techniques are applied to detection, analysis and evaluation of
risks potentially existing in clinical environments.
We applied this technique to the following two medical domains:
risk aversion of nurse incidents and infection control.
The results show that data mining methods were effective
to detection and aversion of risk factors.
- Communication Error Determination System for Multi-layered or Chained Situations
Akinori Abe, Yukio Ohsawa, Hiromi Itoh Ozaku, Kaoru Sagara, Noriaki Kuwahara, Kiyoshi Kogure 123-142
Many medical accidents and incidents occurred due to communication errors. To avoid such incidents, in this paper, we propose a system for determining communication errors.
Especially, we propose a model that can be applied to multi-layered
or chained situations.
First, we provide an overview of communication errors in nursing activities. Then we
describe the warp and woof model for nursing task that was proposed by Harada and considers multi-layered or chained situations.
Next we describe a system for determining communication errors based on the warp and woof model for nursing task. The system is capable of generating nursing activity diagrams semi-automatically and compiles necessary nursing activities.
We also propose a prototype tagging of the nursing corpus for an
effective generation of the diagrams.
Then we combine the diagram generation with the Kamishibai KeyGraph
to determine possible points of the hidden or potential
factors of communication errors.
98 (2-3) 2010
- Deterministic and Unambiguous Families within Recognizable Two-dimensional
Languages
Marcella Anselmo, Dora Giammarresi, Maria Madonia 143-166
Recognizable two-dimensional languages (REC) are defined by tiling
systems that generalize to two dimensions non-deterministic finite
automata for strings. We introduce the notion of deterministic
tiling system and the corresponding family of languages (DREC) and
study its structural and closure properties. Furthermore we show
that, in contrast with the
one-dimensional case, there exist other classes between
deterministic and non-deterministic families that we separate by
means of examples and decidability properties.
- Feature Selection via Maximizing Fuzzy Dependency
Qinghua Hu, Pengfei Zhu, Jinfu Liu, Yongbin Yang and Daren Yu 167-181
Feature selection is an important preprocessing step in pattern
analysis and machine learning. The key issue in feature selection is
to evaluate quality of candidate features. In this work, we
introduce a weighted distance learning algorithm for feature
selection via maximizing fuzzy dependency. We maximize fuzzy dependency between features and
decision by distance learning and then evaluate the quality of
features with the learned weight vector. The features deriving great
weights are considered to be useful for classification learning. We
test the proposed technique with some classical methods and the
experimental results show the proposed algorithm is effective.
- Learning Behaviors of Functions
Bala Kalyanasundaram, Mahe Velauthapillai 183-198
We consider the inductive inference model of Gold [15].
Suppose we are given a set of functions that are
learnable with certain number of mind changes and errors.
What properties of these functions are learnable
if we allow fewer number of mind changes or errors?
In order to answer this question this paper extends the
Inductive Inference model
introduced by Gold [15]. Another motivation for this
extension is to understand and characterize
properties that are learnable for a given set of functions.
Our extension considers a wide range of properties
of function based on their input-output relationship.
Two specific properties of functions are studied in this paper.
The first property, which we call modality, explores
how the output of a function fluctuates. For example,
consider a function that predicts the price of a stock.
A brokerage company buys and sells stocks very often in a day
for its clients with the intent of maximizing their profit.
If the company is able predict the
trend of the stock market "reasonably" accurately then it is
bound to be very successful.
Identification criterion for this property of a function f is called PREX which
predicts if f(x) is equal to, less than or greater than f(x+1) for each x.
Next, as opposed to a constant tracking by a brokerage company,
an individual investor does not often track dynamic changes in stock values.
Instead, the investor would like to move the investment to a less risky option
when the investment exceeds or falls below certain threshold.
We capture this notion using an
identification criterion called TREX that
essentially predicts if a function value is at, above,
or below a threshold value.
Conceptually, modality prediction (i.e., PREX) and threshold prediction
(i.e., TREX) are ëasier" than EX learning. We show that
neither the number of errors nor the number of mind-changes
can be reduced when we ease the learning criterion from exact
learning to learning modality or threshold. We also prove
that PREX and TREX are totally different properties
to predict.
That is, the strategy for a brokerage company may not be a
good strategy for individual investor and vice versa.
- Adaptive Rough Entropy Clustering Algorithms in Image Segmentation
Dariusz Małyszko, Jarosław Stepaniuk 199-231
High quality performance of image segmentation methods presents one leading priority in design and implementation of image analysis systems. Incorporating the most important image data information into segmentation process has resulted in development of innovative frameworks such as fuzzy systems, rough systems and recently rough - fuzzy systems. Data analysis based on rough and fuzzy systems is designed to apprehend internal data structure in
case of incomplete or uncertain information. Rough entropy framework proposed in [12,13] has been dedicated for application in clustering systems, especially for image segmentation systems. We extend that framework into eight distinct rough entropy measures and related clustering algorithms. The introduced solutions are capable of adaptive incorporation of the most important factors that contribute to the relation between data objects and makes possible better understanding of the image structure. In order to prove the relevance of the proposed rough entropy measures, the evaluation of rough entropy segmentations based on the comparison with human segmentations from Berkeley and Weizmann image databases has been presented. At the same time, rough entropy based measures applied in the domain of image segmentation quality evaluation have been compared with standard image segmentation indices. Additionally, rough entropy measures seem to comprehend properly properties validated by different image segmentation quality indices.
- Design of a Hybrid Quantizer with Variable Length Code
Zoran H. Perić, Milan R. Dinčić, Marko D. Petković 233-256
In this paper a new model for compression of Laplacian source is given. This model consists of hybrid quantizer whose output levels are coded with Golomb-Rice code. Hybrid quantizer is combination of uniform and nonuniform quantizer, and it can be considered as generalized quantizer, whose special cases are uniform and nonuniform quantizers. We propose new generalized optimal compression function for companding quantizers. Hybrid quantizer has better performances (smaller bit-rate and complexity for the same quality) than both uniform and nonuniform quantizers, because it joins their good characteristics. Also, hybrid quantizer allows great flexibility, because there are many combinations of number of levels in uniform part and in nonuniform part, which give similar quality. Each of these combinations has different bit-rate and complexity, so we have freedom to choose combination which is the most appropriate for our application, in regard to quality, bit-rate and complexity. We do not have such freedom of choice when we use uniform or nonuniform quantizers. Until now, it has been thought that uniform quantizer is the most appropriate to use with lossless code, but in this paper we show that combination of hybrid quantizer and lossless code gives better performances. As lossless code we use Golomb-Rice code because it is especially suitable for Laplacian source since it gives average bit-rate very close to the entropy and it is easier for implementation than Huffman code. Golomb-Rice code is used in many modern compression standards. Our model can be used for compression of all signals with Laplacian distribution.
- Tree Structure Based Data Hiding for Progressive Transmission Images
Piyu Tsai 257-275
Progressive image transmission (PIT) is supported by several
encoders such as SPIHT, JPEG2000 and so on. However, few of data
hiding scheme for progressive transmission images is designed. In
this paper, tree-structure-based data hiding for set partitioning
in hierarchical trees (SPIHT) images is proposed. The bit stream
of SPIHT multi-stage encoded was structured of tree. The secret
image can be progressively embedded into SPIHT bit trees.
Experimental results showed the progressive image data hiding is
achieved. The secret image is progressively embedded/extracted in
SPIHT encoded images. Furthermore, a higher hiding capacity was
provided in an earlier encoding which helped the secret image to
be identified earlier. Also, an adaptive hiding capacity could be
developed by using different tree structures. In comparison with
Tsai et al.'s scheme, the proposed scheme had a higher hiding
capacity and a better secret image quality.
- On the Insecurity of an Identity Based Proxy Re-encryption Scheme
Xu An Wang, Xiaoyuan Yang 277-281
At Pairing'07, Matsuo proposed two proxy re-encryption schemes:
proxy re-encryption from CBE to IBE and IBE to IBE. Now both schemes
have been standardized by P1363.3 workgroup. In this paper, we show
that their identity based proxy re-encryption scheme is insecure. We
give two attacks to this scheme. The first attack shows that the
proxy can re-encrypt any IBE user's ciphertext to be the delegatee's
ciphertext. The second attack implies that, if the proxy colludes
with any delegatee, the proxy and this delegatee can derive any
other IBE user's secret key.
- Free-Choice Petri Nets without Frozen Tokens, and Bipolar
Synchronization Systems
Joachim Wehler 283-320
Bipolar synchronization systems (BP-systems) constitute a class of coloured Petri nets, well suited for modelling the
control flow of discrete dynamical systems. Every BP-system has an underlying ordinary Petri net, a T-system. It further has a second ordinary net attached, a free-choice system. We prove that a BP-system is safe and live if the T-system and the free-choice system are safe and live and the free-choice system in addition has no frozen tokens. This result is the converse of a theorem of Genrich and Thiagarajan and proves an old conjecture. As a consequence we obtain two results
about the existence of safe and live BP-systems with prescribed ordinary Petri nets. For the proof of these theorems we
introduce the concept of a morphism between Petri nets as a means of comparing different Petri nets. We then apply the
classical theory of free-choice systems.
- Capacity-Raising Steganography Using Multi-Pixel Differencing
and Pixel-Value Shifting Operations
Cheng-Hsing Yang, Shiuh-Jeng Wang, Chi-Yao Weng 321-336
In order to provide large embedding capacity and to minimize distortion for the stego-image, a steganographic method using multi-pixel differencing is presented in this paper. It takes into consideration four pixels of a block, and the differences between the lowest gray-value pixel and its surrounding pixels are used to determine the degree of smoothness and sharpness for embedding the secret data. If the difference values are large in a block, and the block is located in the sharp area then more data can be embedded. On the other hand, if the block is located in the smooth area less data can be embedded. The multi-pixel differencing method is employed in our scheme. We also propose the pixel-value shifting method to increase the image quality. The experimental results show that our scheme has a large embedding capacity without creating a noticeable distortion.
98 (4) 2010
- Bridging Logic and Computer Science:
to Johann A. Makowsky for his 60th birthday. Preface i-i
- On the Borel Complexity of MSO Definable Sets of Branches
Mikołaj Bojańczyk, Damian Niwiński, Alexander Rabinovich, Adam Radziwończyk-Syta, Michał Skrzypczak 337-349
An infinite binary word can be identified with a branch in the full
binary tree. We consider sets of branches definable in monadic
second-order logic over the tree, where we allow some extra
monadic predicates on the nodes. We show that this class
equals to the Boolean combinations of sets in the
Borel class S^{0}_{2} over the
Cantor discontinuum. Note that the last coincides with the
Borel complexity of w-regular languages.
- Properties of Almost All Graphs and Generalized Quantifiers
Anuj Dawar, Erich Grädel 351-372
We study 0-1 laws for extensions of
first-order logic by Lindström quantifiers. We state
sufficient conditions on a quantifier Q expressing
a graph property, for the logic FO[Q]
- the extension of first-order logic by means of the quantifier Q -
to have a 0-1 law.
We use these conditions to show, in particular, that FO[Rig], where Rig is the quantifier
expressing rigidity, has a 0-1 law. We also show that extensions of
first-order logic with quantifiers for Hamiltonicity, regularity and
self-complementarity of graphs do not have a 0-1 law.
Blass and Harary pose the question whether there is
a logic which is powerful enough to express Hamiltonicity
or rigidity and which has a 0-1 law. It is
a consequence of our
results that there is no such regular logic (in the sense of abstract
model theory) in the case of Hamiltonicity, but there is
one in the case of rigidity.
We also consider sequences of vectorized quantifiers, and show
that the extensions of first-order logic obtained by adding such
sequences generated by quantifiers that are closed under substructures
have 0-1 laws. The positive results also extend to the infinitary logic with
finitely many variables.
- A Most General Edge Elimination Polynomial - Thickening of Edges
Christian Hoffmann 373-378
We consider a graph polynomial x(G;x,y,z) introduced by Ilia
Averbouch, Benny Godlin, and Johann A. Makowsky (2008). This graph
polynomial simultaneously generalizes the Tutte polynomial as well
as a bivariate chromatic polynomial defined by Klaus Dohmen, André
Pönitz, and Peter Tittmann (2003). We derive an identity which
relates the graph polynomial x of a thickened graph (i.e. a
graph with each edge replaced by k copies of it) to x of the
original graph. As a consequence, we observe that at every point
(x,y,z), except for points lying within some set of dimension 2,
evaluating x is #P-hard. Thus, x supports Johann
A. Makowsky's difficult point conjecture for graph polynomials
(2008).
- A Note on Two-pebble Automata Over Infinite Alphabets
Michael Kaminski, Tony Tan 379-390
It is shown that the emptiness problem for
two-pebble automata languages is undecidable and
that two-pebble automata are weaker than three-pebble automata.
- Tree-width in Algebraic Complexity
Klaus Meer 391-409
The paper surveys some of the author's work studying
the algorithmic importance of the tree-width notion in
algebraic frameworks. Two approaches are described. The first
gives an algorithmic meta-theorem for certain logically
characterized properties within the Blum-Shub-Smale
BSS model of computation over the reals. The second
reports on recent joint
work with P. Koiran relating Boolean complexity
and Valiant's approach
to study families of polynomial systems over
infinite fields and their complexity.
We define particular families of polynomials via bounding the
tree-width of suitably attached graphs and study the expressive power
of the resulting families.
The work described here is partially co-authored with and partially very much
influenced by previous work of Janos A. Makowsky.