Interesting Data Representations for Humans and Machines
Representation learning is the practice of automatically discovering useful representations from raw data. These representations capture the properties of each object in complexly structured data, such as social networks, images, and spatial data, in a simple format, such as a vector of numbers. Having all objects represented in a simple unified format makes it possible to subsequently apply a wide range of machine learning methods, and thus it increases the versatility of machine learning methods. This increase in versatility gives representation learning an important role in exploratory data analysis as well as in data-based decision making.
My PhD research introduces a novel mathematically principled framework for representation learning in which prior knowledge about the data can be accounted for. The research hypothesis is that this framework enables the learning of more informative representations, and it makes it possible to learn representations iteratively, where in each iteration a representation is learned that captures information complementary to the already found representations. In turn, this enables new types of human-in-the-loop data exploration and increases the overall quality of machine learning models and predictions. The research hypothesis was tested through extensive empirical studies on many data science tasks, suggesting a positive answer to the hypothesis. In this blog post, I introduce two studies of our research hypothesis.
Visual representation learning methods are common tools in exploratory data analysis. By analyzing low-dimensional representations of data visually, a data scientist gains knowledge and insights. However, current visual representation learning methods yield a single static representation, which is insufficient to capture the complete structure of the data. In addition, the salient structure of the data in the representation is often already known. Thus, complementary representations are desired for efficient data exploration. To address this, we developed a representation-learning method called conditional t-SNE (ct-SNE) that effectively captures the structure of the data that complements specifically expressed prior knowledge of an analyst, enabling new insights to be discovered. ct-SNE has been implemented in Facebook’s codebase. An open source version of ct-SNE has also been made available. This open source version has been adopted by other companies to improve their production process.
Network representation learning aims to capture the structure of a network in a vector-space representation. However, due to the complex structural properties of networks, some information cannot be fully captured using the conventional Euclidean-vector-based representations. To better model network data, we first encode specific complex structural information as prior knowledge and then model the remaining information of the network using Euclidean vectors. Combining the prior with the vector representation results in a better model of the data. We call this method Conditional Network Embedding (CNE). Through empirical validation, we demonstrated that the use of CNE leads to superior performance in downstream tasks (e.g., link prediction, node classification) as compared to state-of-the-art approaches in graph learning. Directly integrating CNE into existing network-embedding based machine learning pipelines is expected to have a positive impact on the performance. To date, the scientific novelty of CNE has sparked multiple lines of research such as explainable AI, active representation learning, debiased representation learning, and more.
Next, I will continue to collaborate with my advisors Prof. Tijl De Bie and Prof. Jefrey Lijffijt in the Artificial Intelligence and Data Analytics (AIDA) group at Ghent University. My PhD research was funded by Prof. De Bie’s ERC Consolidator grant FORSIED. Current research funding in the group includes Prof. De Bie’s Odysseus Group I project (Exploring Data: Theoretical Foundations and Applications to Web, multimedia, and Omics Data), the Flemish Government’s "Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen" programme, and several FWO projects. We expect the results of my PhD research to have a wide impact in artificial intelligence (AI) and data science (DS) research and practice.