Mathematical optimization for the visualization of complex datasets

Guerrero Lozano, Vanesa

Mathematical optimization for the visualization of complex datasets

Guerrero Lozano, Vanesa

Supervised by:

Emilio Carrizosa Priego Director
Dolores Romero Morales Director

Defence university: Universidad de Sevilla

Fecha de defensa: 26 June 2017

Committee:

Albert Satorra Brucart Chair
Josefa Ramírez Cobo Secretary
Alfredo Marín Pérez Committee member
Martine Labbé Committee member
Belén Martín Barragán Committee member

Type: Thesis

Teseo: 467834 DIALNET Idus editor

Abstract

This PhD dissertation focuses on developing new Mathematical Optimization models and solution approaches which help to gain insight into complex data structures arising in Information Visualization. The approaches developed in this thesis merge concepts from Multivariate Data Analysis and Mathematical Optimization, bridging theoretical mathematics with real life problems. The usefulness of Information Visualization lies with its power to improve interpretability and decision making from the unknown phenomena described by raw data, as fully discussed in Chapter 1. In particular, datasets involving frequency distributions and proximity relations, which even might vary over the time, are the ones studied in this thesis. Frameworks to visualize such enclosed information, which make use of Mixed Integer (Non)linear Programming and Difference of Convex tools, are formally proposed. Algorithmic approaches such as Large Neighborhood Search or Difference of Convex Algorithm enable us to develop matheuristics to handle such models. More specifically, Chapter 2 addresses the problem of visualizing a frequency distribution and an adjacency relation attached to a set of individuals. This information is represented using a rectangular map, i.e., a subdivision of a rectangle into rectangular portions so that their areas reflect the frequencies, and the adjacencies between portions represent the adjacencies between the individuals. The visualization problem is formulated as a Mixed Integer Linear Programming model, and a matheuristic that has this model at its heart is proposed. Chapter 3 generalizes the model presented in the previous chapter by developing a visualization framework which handles simultaneously the representation of a frequency distribution and a dissimilarity relation. This framework consists of a partition of a given rectangle into piecewise rectangular portions so that the areas of the regions represent the frequencies and the distances between them represent the dissimilarities. This visualization problem is formally stated as a Mixed Integer Nonlinear Programming model, which is solved by means of a matheuristic based on Large Neighborhood Search. Contrary to previous chapters in which a partition of the visualization region is sought, Chapter 4 addresses the problem of visualizing a set of individuals, which has attached a dissimilarity measure and a frequency distribution, without necessarily cov-ering the visualization region. In this visualization problem individuals are depicted as convex bodies whose areas are proportional to the given frequencies. The aim is to determine the location of the convex bodies in the visualization region. In order to solve this problem, which generalizes the standard Multidimensional Scaling, Difference of Convex tools are used. In Chapter 5, the model stated in the previous chapter is extended to the dynamic case, namely considering that frequencies and dissimilarities are observed along a set of time periods. The solution approach combines Difference of Convex techniques with Nonconvex Quadratic Binary Optimization. All the approaches presented are tested in real datasets. Finally, Chapter 6 closes this thesis with general conclusions and future lines of research