SIMBIG

Programa

US Eastern time (New York = Peru time): UTC−05:00

Wednesday December 1st

Time
Author(s)
Presentation
9h00 - 9h20 Welcome to SIMBig 2021

Data mining and Applications

9h20 - 9h40 Juan Ignacio Porta, Martin Ariel Dominguez and Francisco Tamarit Automatic data imputation in time series processing using neural networks for industry and medical datasets
9h40 - 10h00 Carlos Gamboa-Venegas, Steffan Gómez-Campos and Esteban Meneses Calibration of traffic simulations using simulated annealing and GPS navigation records
10h00 - 10h45
Keynote Speaker: Andrei Broder

Title: The Web Advertising Ecosystem


Abstract:

The World Wide Web is arguably an engineering artifact and social environment that defines our era. A large part of it is made possible by money generated via advertising. The goal of this talk is to give an introduction to the web advertising ecosystem and illuminate the complex relations between consumers, publishers, and advertisers.

10h45 - 11h05 Adrian Ulloa, Soledad Espezua, Julio Villavicencio, Oscar Miranda and Edwin Villanueva Predicting daily trends in the Lima Stock Exchange General Index using economic indicators and financial news sentiments
11h05 - 11h25 Miguel Nunez-Del-Prado and Leibnitz Rojas-Bustamante Government Public Services Presence Index based on Open Data
11h25 - 11h45 Edwin Alvarez Mamani, José Luis Soncco Álvarez and Harley Vera Olivera Clustering Analysis for Traffic Jam Detection for Intelligent Transportation System
PAUSE

Machine Learning and Deep Learning

14h10 - 14h30 Eya Hammami and Rim Faiz A Study of Dynamic Convolutional Neural Network Technique for SCOTUS legal opinions data classification
14h30 - 15h15
Keynote Speaker: Jiawei Han

Title: From Unstructured Text Data to Structured Knowledge: A Data-Driven Approach


Abstract:

The real-world big data are largely dynamic, interconnected and unstructured text. It is highly desirable to transform such massive unstructured data into structured knowledge. Many researchers rely on labor-intensive labeling and curation to extract knowledge from such data. Such approaches, however, are not scalable. We vision that massive text data itself may disclose a large body of hidden structures and knowledge. Equipped with pretrained language models and text embedding methods, it is promising to transform unstructured data into structured knowledge. In this talk, we introduce a set of methods developed recently in our group for such an exploration, including joint spherical text embedding, discriminative topic mining, taxonomy construction, text classification, and taxonomy-guided text analysis. We show that data-driven approach could be promising at transforming massive text data into structured knowledge.

15h15 - 15h35 Alonso Puente and Marks Calderon Hydra: Funding state prediction for Kickstarter Technology projects using a Multimodal Deep Learning
15h35 - 15h55 Naomi Rohrbaugh and Edgar Ceh-Varela Composite recommendations with heterogeneous graphs
15h55 - 16h15 Gianfranco Campos, Alessandro Morales, Arturo Flores and Jorge Gelso Energy Efficiency Using IOTA Tangle for Greenhouse Agriculture
16h30 - 17h15
Keynote Speaker: Jian Pei

Title: Towards Trustworthy Data Science: Interpretability, Fairness and Marketplaces


Abstract:

We believe data science and AI will change the world. No matter how smart and powerful an AI model we can build, the ultimate testimony of the success of data science and AI is users’ trust. How can we build trustworthy data science? At the level of user-model interaction, how can we convince users that a data analytic result is trustworthy? At the level of group-wise collaboration for data science and AI, how can we ensure that the parties and their contributions are recognized fairly, and establish trust between the outcome (e.g., a model built) of the group collaboration and the external users? At the level of data science participant eco-systems, how can we effectively and efficiently connect many participants of various roles and facilitate the connection among supplies and demands of data and models?
In this talk, I will brainstorm possible directions to the above questions in the context of an end-to-end data science pipeline. To strengthen trustworthy interactions between models and users, I will advocate exact and consistent interpretation of machine learning models. Our recent results show that exact and consistent interpretations are not just theoretically feasible, but also practical even for API-based AI services. To build trust in collaboration among multiple participants in coalition, I will review some progress in ensuring fairness in federated learning, including fair assessment of contributions and fairness enforcement in collaboration outcome. Last, to address the need of trustworthy data science eco-systems, I will review some latest efforts in building data and model marketplaces and preserving fairness and privacy. Through reflection I will discuss some challenges and opportunities in building trustworthy data science for possible future work.

Thursday December 2nd

Time
Author(s)
Presentation

Data-Driven Software Engineering

9h20 - 9h40 Airton Huaman, Marco Huancahuari and Lenis Wong Multi-phase model based on K-means and Ant Colony Optimization to solve the capacitated vehicle routing problem with time windows
9h40 - 10h00 Geraldine Puntillo, Alonso Salazar and Lenis Wong Enterprise architecture based on TOGAF for the adaptation of educational institutions to e-learning using the DLPCA methodology and Google Classroom
10h00 - 10h45
Keynote Speaker: Jean Vanderdonckt

Title: Dimension reduction by model-based approaches: application to gesture recognition


Abstract:

Machine learning algorithms used for 2D/3D gesture recognition typically require a large training set of templates having many dimensions, depending on the sensor used. Instead of applying classical methods for reducing the dimensionality of these templates, we propose relying on a model-based approach where the problem is first mathematically described and then submitted to machine learning algorithms.

10h45 - 11h05 Pablo Del Aguila, Dante Roque and Lenis Wong Mobile app quality model based on SQuaRE and AHP

Health, NLP, and Social Media

11h05 - 11h25 Tereza Yallico Arias and Junior Fabian Automatic detection of levels of violence against women with Natural Language Processing using Machine Learning and Deep Learning techniques
11h25 - 11h45 Taghreed Tarmom, Eric Atwell and Mohammad Alsalka Deep Learning vs Compression-Based vs Traditional Machine Learning Classifiers to Detect Hadith Authenticity
11h45 - 12h05 Randa Zarnoufi and Mounia Abik Classical Machine Learning vs Deep Learning for Detecting Cyber-Violence in Social Media
PAUSE
14h10 - 14h30 Nuhu Ibrahim and Riza Batista-Navarro Automatic Detection of Deaths from Social Networking Sites
14h30 - 15h15
Keynote Speaker: Marinka Zitnik

Title: Infusing Structure and Knowledge Into Biomedical AI


Abstract:

Grand challenges in biology and medicine often lack annotated examples and require generalization to entirely new scenarios not seen during training. However, standard supervised learning is incredibly limited in scenarios, such as designing novel medicines, modeling emerging pathogens, and treating rare diseases. In this talk, I present our efforts to overcome these obstacles by infusing structure and knowledge into learning algorithms. First, I will present general-purpose and scalable algorithms for few-shot learning on graphs. At the core is the notion of local subgraphs that transfer knowledge from one task to another, even when only a handful of labeled examples are available. This principle is theoretically justified as we show the evidence for predictions can be found in subgraphs surrounding the targets. I will conclude with applications in drug development and precision medicine where the algorithmic predictions were validated in human cells and led to the discovery of a new class of drugs.

15h15 - 15h35 Camila Mantilla-Saavedra and Juan Gutiérrez-Cárdenas Model comparison for the classification of comments containing suicidal traits from Reddit via NLP and Supervised Learning
15h35 - 15h55 Syed Mehtab Alam, Elena Arsevska, Mathieu Roche and Maguelonne Teisseire A data-driven score model to assess online news articles in Event-based surveillance system
15h55 - 16h15 Tomonari Masada AmLDA: A Non-VAE Neural Topic Model
16h15 - 17h00
Keynote Speaker: Francisco Pereira

Title: Revealing interpretable object representations from human behaviour


Abstract:

Objects can be characterized according to a vast number of possible criteria (e.g. animacy, shape, color, function), but some dimensions are more useful than others for making sense of the objects around us. In this talk, I will describe an ongoing effort by our collaborators to collect a behavioral dataset of millions of odd-one-out similarity judgements on thousands of objects, and a new approach to identify the "core dimensions" of object representations used in those judgements. Our approach models each object as a sparse, non-negative embedding, and judgements as a function of the similarity of those embeddings. The resulting model predicts subject behaviour on test data, as well as the fine-grained structure of object similarity. The dimensions of the embedding space are coherently interpretable by test subjects, and reflect degrees of taxonomic membership, functionality, and perceptual or structural attributes, among other characteristics. Further, naive subjects can accurately rate objects along these dimension, without training. Collectively, these results demonstrate that human similarity judgments can be captured by a fairly low-dimensional, interpretable embedding that generalizes to external behaviour.

17h00 - 17h20 Asma Aldrees, Cherie Poland and Syeda Arzoo Irshad Auditing Algorithms: Determining Ethical Parameters of Algorithmic Decision-Making Systems in Healthcare
17h20 - 17h35 Moises Meza, Willian Araujo, and Jesus Alvarado Bibliometric analysis using Spark and HPCtechniques to search of potential inhibitorstargeting SARS-CoV-2 Main Protease

Friday December 3rd

Time
Author(s)
Presentation

Image Processing

9h05 - 9h20 Ibrahim Shehzad, Adeel Zafar, Zahir Shah, and Zilli Huma Breast Cancer CT-Scan Image Classification Using Transfer Learning
9h20 - 9h40 Alejandra Valeria Lucero Burbano, Sherald Damian Noboa Chavez and Manuel Eugenio Morocho Cayamcela Plant Disease Classification and Severity Estimation: A Comparative Study of Multitask Convolutional Neural Networks and First Order Optimizers
9h40 - 10h00 Diego Hernán Suntaxi Domínguez, Oscar Vicente Guarnizo Cabezas, Jonnathan Fabricio Crespo Yaguana, Samantha Carolina Quintanchala Sandoval, Israel Gustavo Pineda Arias and Manuel Eugenio Morocho Cayamcela Deep Learning and Computer Vision in Smart Agriculture: Datasets, Models, and Applications
10h00 - 10h20 Carla Rucoba, Efrain Ramos and Juan Gutierrez-Cardenas Crack detection in oil paintings using morphological filters and K-SVD algorithm
10h20 - 10h40 Filomen Incahuanaco Quispe, Edward Hinojosa Cardenas, Denis Pilares Figueroa and Cesar Beltrán Castañón CoffeeSE: Interpretable transfer learning method for estimating the severity of coffee rust
10h40 - 11h00 Joel Cabrera and Edwin Villanueva Investigating generative neural-network models for building pest insect detectors in sticky trap images for the Peruvian horticulture

Semantic and Machine Learning

11h00 - 11h45
Keynote Speaker: Vipin Kumar

Title: Big data in water: Opportunities and challenges for machine learning


Abstract:

Water resources worldwide are coming under stress due to increasing demand from a growing population, increasing pollution, and depleting or uncertain supplies due to changing climate in which drought and floods have both become more frequent. As domains associated with Water continue to experience tremendous data growth from models, sensors, and satellites, there is an unprecedented opportunity for machine learning to help address urgent water challenges facing the humanity. This talk will examine the role of big data and machine learning can play in advancing water science, challenges faced by traditional Machine learning methods in addressing the domain of water, and some early successes.

11h45 - 12h00 David Gatta, Kilian Hinteregger and Anna Fensel Making Licensing of Content and Data Explicit with Semantics and Blockchain
12h00 - 12h45
Keynote Speaker: Natasha Noy

Title: Google Dataset Search: Building an open ecosystem for dataset discovery


Abstract:

There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others’ work, and providing data journalists easier access to information and its provenance. This talk will discuss Dataset Search by Google, which provides search capabilities over potentially all dataset repositories on the Web. We will talk about the open ecosystem for describing datasets that we hope to encourage.

12h45 - 13h05 Chetraj Pandey, Rafal Angryk and Berkay Aydin Deep Neural Networks based Solar Flare Prediction using Compressed Full-disk Line-of-sight Magnetograms
13h05 - 13h25 David Gatta, Kilian Hinteregger and Anna Fensel Prediction Of Soil Saturated Electrical Conductivity By Statistical Learning
13h25 - 13h35 Closing SIMBig 2021

Download the SIMBig 2021 program here

Contactos

Juan Antonio
Lossio-Ventura

Ph.D. in Computer Science

National Institutes of Health

Bethesda, USA

Hugo
Alatrista-Salas

Ph.D. in Computer Science

Universidad de Ingeniería y Tecnología - UTEC

Lima, PERU