hanaa phd presentation 14-4-2017
TRANSCRIPT
ByHanaa Ismail Elshazly
PhD Student Faculty of Computers and Information
Cairo University
Intelligent Visualization of Multi Dimension
Data Sets
Faculty of Computers and Information - Cairo UniversityDepartment of Computer Sciences
Supervisors
Prof Aboul Ella Hassanien & Prof. Abeer Mohamed El
Korany
Big Image
Multidimensional data
Reduction
Visualize
Intelligent Visualization of Multidimensional Data Sets
Dimensions: A dimension is a key descriptor, an index, by which you can access facts according to the value (or values) you want
Information visualization is the study of (interactive) visual representations of abstract data to reinforce human cognition. The abstract data include both numerical and non-numerical data, such as text and geographic information
ContentsIntroduction1
2
3 Experimental Results
4 Conclusion
55 Future Work
6
Proposed Framework
IntroductionGeneral
• Massive and complex data are generated every day in many fields due to the advance of hardware and software technology.
• Curse of dimensionality is a major obstacle in machine learning and data mining.
• Clinical data referring to patients’ investigations contain irrelevant attributes that degrade the classification performance.
• Visualization is important when analyzing multidimensional datasets, since it can help humans discover and understand complex relationships in data.
IntroductionData Problems
Data Quality Integrating redundant data
from different sources Mining information from
heterogeneous databases Difficulty in training set Dynamic databases Dimensionality
Introduction Dimensionality reduction
• In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration via obtaining a set of principal variables. It can be divided into feature selection and feature extraction.
• Most popular search methods that are manageable in low space can be totally unmanageable in high dimension space
• The curse of dimensionality is a major obstacle in machine learning and data mining
• Reduction of the dimensionality of features space leads to a successful classification Selecting the optimal feature subset can substantially improve the classification performance
Filter
Wrapper
Embedded
• Improve the comprehensibility of the induced concepts
• Decrease of dataset complexity
• Improve classification performance
• Resources saving• Visualization ability• Better understanding of
extracted knowledge• Reducing computation
Requirement• Reduces the effect of
curse of dimensionality
FS Techniques
Reduced DataMassive Data
Microarray GE
Medical Images
Huge Databases
Finance Data
Sensor Arrays
Web Documents
Introduction Dimensionality reduction
Introduction The curse of Dimensionality
Damming Factor
Computational Complexity
Limits applicability of ML techniques to real world problems
Slow Learning Process
Difficulty of Inducing Concepts
Decrease Predictive Performance
Add extra difficultiesin finding potentially useful knowledge
Difficulty to add visualization ability
Limited human capability
Human inspection and interpretation of the data is not feasible
Intractable behavior of Search Methods
Conventional database management and dataanalysis tools are insufficient.
Storage requirements
www.themegallery.com
Proposed General Framework
Pre-processing phase
Feature selection
phaseClassification phase
Rule refine phase
Visualization phase
Proposed General Framework
Preprocess Phase
Discretization
Simplification
Equal Binning
FeatureSelection
Phase
Reduction
PCARS
Classification Phase
Rough Set Rules
Generation
Rough Set Discernibility Matrix
Rules Refinement
Phase
Reduction
Entropy GA
Visualization
Visualization
NodesEdgesChartsGrids
Experimental Data Sets
Classes Instances Features Source Data Set 2 classes 569 samples Features
32 UCI (Machine LearningRepository)
Wisconsin Breast Cancer–Diagnosis
2 classes samples 198 Features 32
UCI (Machine LearningRepository)
Wisconsin Breast Cancer–Prognosis
2 classes 267 samples 45 Features
UCI (Machine LearningRepository)
SPECTF Heart Dataset
4 classes 148 samples 18 Features University MedicalCentre, Institute of Oncology, Ljubljana, Yugoslavia
Lymphography
2 classes 583 samples 11 Features UCI (Machine LearningRepository)
Indian Liver Patient Dataset
2 classes 102 samples 12600 Features
UCI (Machine LearningRepository)
Prostate
Pre-processing Phase
Aim : Used to reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals and replacing low level concepts by higher level concepts.
Techniques:• Equal Binning : Transform
numerical variables into categorical counterparts.
• Simplification : Rescaling data in the range [1,3].
PREPROCESS
Discretization
Discretized Data
Simplification
Simplified Data
Multidimensional Data
Discretization
Pre-processing PhaseEqual Binning Algorithm
Foreach feature V in data (D){ Dividing domain of V into k intervals of equal size. The width of intervals is:
w = (max(V)-min(V))/k
And the interval boundaries are:
min+w, min+2w, ... , min+(k-1)w}
Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer classification”, Proceedings of the Information and Communication Technologies, (WICT), ISBN: 978-1-4673-4806-5, World Congress, IEEE, pp 260-265, 2012.
How Discretization techniques influence the classification of breast cancer data
Bool.Reas% Binging %
Entropy %
91 92.9 77.2 Naïve Bayes95.3 95.3 91.4 Decision Rules94 94.7 76.1 KNN
Feature Selection Phase Feature Selection
Phase
Rough Set
PCA
Positive Regions Extraction
Discernibility Matrix
Reduced Data
Positive Regions
Final Reducts
Simplified Data
Aim: Determine a minimal feature subset that best contribute to accuracy and retain high efficiency in representing the original features while negligee the features with little contribution in prediction process.
PCA (Principal component Analysis) • A statistical technique useful in data
compression and reduction.• Rough Sets
• The main goal of the rough set analysis is induction of (learning) approximations of concepts.
Principal component Analysis Algorithm
Feature Selection Phase
Feature Selection Phase PCA Performance as a transformation method in ROTATION FOREST for Chronic eye disease diagnosis
• Hanaa Ismail Elshazly, Abeer Mohamed El Korany, Aboul Ella Hassanien, Ahmad Taher Azar, “Ensemble classifiers for biomedical data : performance evaluation”, 8th International Conference on Computer Engineering & Systems (ICCES), ISBN: 978-1-4799-0078-7, pp 184-189, 2013.
• Hanaa Ismail Elshazly, Abeer Mohamed El Korany, Aboul Ella Hassanien, Mohamed Waly, “Chronic Eye Disease diagnosis using ensemble-based classifier ”, Second International Conference on Engineering and Technology(ICET), German University – Cairo-Egypt, 2014.
Many transformation methods were applied in the literature such as Principal component analysis (PCA), nonparametric discriminant analysis (NDA), random projections (RP), independent component analysis (ICA). • PCA gave the best results due to the provided diversity. • PCA preserves the discriminatory features.• PCA provides the best results compared to those extracted through non-
parametric discriminant analysis (NDA) or random projections. • PCA was chosen as a transformation method in the following research
papers :
Feature Selection Phase
Hanaa Ismail Elshazly, Ahmad Taher Azar, Abeer Mohamed El Korany, Aboul Ella Hassanien, “Hybrid System based on Rough Sets and Genetic Algorithms for Medical Data Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi: 10.4018/ijfsa.2013100103, 3(4), 31-46, 2013.Descrinibility
Rough Sets for Reduct GenerationLet T = (U, C, D) be a decision table, with
}.,...,,{ 21 nuuuU M(T), we will mean matrix defined as:
)]d(u)[d(u Dd if )}c(u)c(u :C{c)]d(u)[d(u Dd if λ ij
jijiji
m
nn
ijm ,Uui }},...,2,1{,:{)( njijmuf ijjiT
ijm ,ijma .ijm
),( falsemij .ijm
),(truetmij .ijm
Where
is the disjunction of all variables a such that
(2)
(3)
if
if
(1) if
For any
Classification Phase
Classification Phase
Phas
e
Rule Generation
Classification with Decision
Rules
Testing
Generated Rules
Classified Instances
Tested Instances
Multidimensional Data
Final Reducts
Aim : The learning algorithm called classifier has as goal to return a set of decision rules with a procedure that makes possible to classify objects not found in the original decision table. Rough Set Rules Generation using Discernibility Matrix
Rough Set Rules Generation Algorithm
Let T = (U, C, D) be a decision table, with }.,...,,{ 21 nuuuU M(T), we will mean matrix defined as:
)]d(u)[d(u Dd if )}c(u)c(u :C{c)]d(u)[d(u Dd if λ ij
jiji
jim
nn
ijm is the set of all the condition attributes that classify objects ui and uj into different classes.
,Uui }},...,2,1{,:{)( njijmuf ijjiT
ijm ,ijma .ijm),( falsemij .ijm
),(truetmij .ijm
Whereis the disjunction of all variables a such that
(2)
(3)
if
if
(1) if
Comparison of different classifiers against different data Sets
Hanaa Ismail Elshazly et al., “Rough Sets and Genetic Algorithms: A hybrid approach to breast cancer classification”, Proceedings of the Information and Communication Technologies, (WICT), ISBN: 978-1-4673-4806-5, World Congress, IEEE, pp 260-265, 2012.
Hanaa Ismail Elshazly et al., “Hybrid System based on Rough Sets and Genetic Algorithms for Medical Data Classifications”, International Journal of Fuzzy System Applications (IJFSA), doi: 10.4018/ijfsa.2013100103, 3(4), 31-46, 2013.
Rules Refinement Phase
RULE REFINEMENT
GeneratedReducts
InformativeReduct
All Rules Generated
Rules Allocation
Selected Rules
Testing
Criteria Termination
Classified Instances
Reducts Evaluation
Multidimensional Data
GA
Refined Decision Rules
Test Multidimensional
Data
Reduce rules number to be easily visualized and presented to an expert without decreasing the accuracy. Reduct Evaluation using Entropy GA using Support and Confidence as Fitness Function
Reduct EvaluationAlgorithms of Decision tree depend on Information Gain to find the expected amount of information that would be needed to truly classified.
Calculate entropy of the target : Gain(T) = Entropy (T);
Entropy (T) = where c is the possible values of the target Foreach in Reducts { Foreach x In R
{ Entropy (T,X) = }
} Choose with the largest information gain.
i2c
1i i plogp
iR
E(c) ) (c xccP
iR),( XTEntEi
Genetic Algorithm Using Support and Confidence as Fitness Function
Body ==> Consequent [ Support , Confidence ] Consequent: represents a discovered property for the
examined data. Support: represents the percentage of the records
satisfying the body or the consequent. Confidence: represents the percentage of the records
satisfying both the body and the consequent to those satisfying only the body.
Visualization PhaseExpert can manage induced rules through levels of trusting that enable fast trust decision.• Graph Nodes • Edges• Charts• Grids
VISUALIZATION
Measurement Calculation for Rules Supporting
Refined Rules with Trusted Levels
Rendering
Rules & Reducts
Refined Decision Rules
Visualization of Breast Cancer Reducts Visualization of features of the breast data set ordered by its occurrence over all extracted reducts.
Experimental Results
Visualization of Breast Cancer Rules Visualization of global and detailed nodes representing refined classification rules of the breast data. 86 R 400 R 87000 R
Experimental Results
Visualization of Breast Cancer Rules
Visualization of Refined Breast Cancer Decision Rules According to Trusting Levels.
Experimental Results
Visualization of Breast Cancer Rules Navigation through Refined Breast Cancer Decision Rules According to Trusting Levels.
Experimental Results
Visualization of Prostate Cancer Reducts Visualization of all reducts of the Prostate Cancer data set and all features ordered by its occurrence in all extracted reducts.
Experimental Results
Visualization of Prostate Cancer Rules Navigation through Refined Prostate Cancer Decision Rules According to Trusting Levels.
26 R 117R 22000 R
Experimental Results
Visualization of Prostate Cancer Rules Visualization of Refined Prostate Cancer Decision Rules According to Trusting Levels.
Experimental Results
Visualization of Prostate Cancer Rules Navigation through Refined Prostate Cancer Decision Rules According to Trusting Levels.
Experimental Results
Performance analysis
DTKNNNBRFDRROTRSGADTKNNNBRFDRROTRSGABreast DiagProstate
0
0.2
0.4
0.6
0.8
1
1.2
0.930.960.920.960.980.971
0.620.660.640.670.62
0.72
1
Accuracy
Hanaa Ismail Elshazly et al., ”Weighted Reduct Selection Metaheuristic Based Approach for Rules Reduction and Visualization” , International Conference on Computing Communication and Automation (ICCCA2016), IEEE, Buddh Nagar Uttar Pradesh, India, 2016
Experimental Results
Conclusions• We have presented an
approach for knowledge-based classification and visualization of decision rules which enhances the classification process and improves the insight into rules knowledge.
• Physician can detect a minimum number of rules with trusted levels to reach an efficient diagnosis of diseases.
Future Work• Promising results of the proposed
approach encourage the possibility of applying the approach on other multi dimensional data sets.
• Other visualization dynamic techniques can be applied to meet the different requirements of physicians.