Teaching 2024-2025

The Big Data Master is a full-time master delivered entirely online that starts in January and lasts one year. The teaching activity has two main phases: the first phase, that covers the period from November to end July, is dedicated to lectures and project activities; the second phase, that covers the period from August to December, is dedicated to a 475 hours Internship by the Partners.

The weekly teaching activity during the first phase is divided in lectures (wednesday to saturday) and lab hours (not mandatory) that allow student to experiment on the field the methods shown in the lectures with the support of the tutors. The steering committee has decided that at least 70% attendance of the lectures hours is required. The weekly teaching activity is described in the following images:

European Credit Transfer System (ECTS)

The European Credit Transfer System (ECTS) is the unit of measure of the volume of learning work, required to a student having an adeguated initial know-how,  for acquiring the knowledge and abilities required by a teaching activity. It is equivalent to 25 hour of work, including lectures, individual studies and other types of activities (like internship). Each teaching activity has an associated number of credits, that are assigned after an exam.

The teaching activity is supported by the Moodle platform, where students will find all the teaching material.

 

Teaching 2024-2025

The teaching activity will start on November 21th 2024.
The scheduled courses and their educational objectives are listed below.
 
Tutors:
Tools:
Credits: 5
Hours: 60
 
Description:

The module has the aim to align the students' competences in computer science and in basic analytics, especially in data bases, and Python programming for data science. Starting form a theoretical introduction to the basics of programming and relational database modelling the course will be focused on pratical lectures for learning to query and modelling databases and to solving problems by writing Python programs in both static and dynimic environments. This module is based on hands-on work

 
Competences:

At the end of the course the student will be able to 

  • Model and query a database
  • Approach a problem with an adequate problem solving strategy
  • Implement simple Python programs
 
Notions:
  • Lesson 1
    • Introduction to Python: data types, variables
  • Lesson 2
    • Functions, environments of functions variables, input-output, Conditional Statements, Loops
  • Lesson 3
    • Strings, tuples, lists, dictionaries 
  • Lesson 4
    • File Access, debugging, environment management
  • Lesson 5
    • Object oriented programming 
  • Lesson 6
    • Numpy  & Pandas
 
Techniques and Tools:
  • SQL 
  • Microsoft SQL Server
  • Python
  • Numpy
  • Pandas

 

Tutors:
Tools:
Credits: 3
Hours: 36
 
Description:

This module presents artificial intelligence techniques aimed at defining analytics on text and data from the Web. The course is organized around three main strands: i) text analytics, where text mining methods applied to texts and social media are studied; ii) sorting techniques through the application of "learning to rank" techniques which have the purpose of estimating the relevance of objects with respect to user requirements, iii) web mining techniques aimed at exploiting user usage data to improve quality of services. Using the query logs of a real search engine as a case study, students will be guided in the development of a set of methodologies for data analysis that aims to create the knowledge base necessary to build a recommender system.

Prerequisites: Machine Learning and Python

 
Competences:

Ability to correctly identify and implement text and web analytics. Ability to use state-of-the-art solutions for text classification, sentiment analysis, sentiment classification, ranking. Ability to use learning to rank techniques and Transformer networks for text (BERT). Ability to define a Web mining problem and design a solution.

 
Notions:
  • Text Analytics
    • Properties of the Language and its Representation
    • Analytics on Text: Tasks, Methods, Applications
    • Language Models: from pure statistical approaches to learned complex solutions
    • Sparse vs. Dense Representations with Neural Approaches
    • Sentiment Analysis & Classification
    • Sentiment Analysis in Python
  • Ranking 
    • Machine Learning for Ranking: from standard techniques to BERT
    • Applications of Neural Networks to Text Ranking: Haystack & HuggingFace
    • Ranking with BERT
  • Web Mining
    • Analytics on Web Usage Data: query log mining for recommendation
    • Methods for Query Suggestion
    • Query Suggestion in Python ed ElasticSearch

 

 
Techniques and Tools:

Python libraries: 

  • NLTK,
  • SpaCy
  • Scikit-learn
  • GenSim
  • VADER
  • Keras
  • Pytorch
  • Haystack
  • Huggingface Transformers
  • ElasticSearch
  • LightGBM
  • BASH
Tutors:
Tools:
Credits: 2
Hours: 24
 
Description:

The module introduces the ethical and legal notions of privacy, anonymity, transparency and discrimination, even considering the General Data Protection Regulation. It presents technologies for implementing the privacy-by-design principle, for auditing of predictive models, and for the protection of users rights with the goal of enabling the Big Data analysis while guaranteeing personal data protection, transparency and non-discrimination.

 
Competences:

At the end of the course the student will be able to analyze the ethical issues in a knowledge discovery process also referring the EU legal framework and will acquire knowledge about some available tools for assessing ethical issues.

 
Notions:
  • Lesson 1
    • Introduction to Big Data Ethics
    • The European legal framework 
  • Lesson 2
    • Privacy-by-Design in Big Data Analytics
    • Data Protection, Privacy and Privacy Models
  • Lesson 3
    • Privacy Risk Assessment & Prediction
    • Privacy-Protection Techniques
    • Privacy Assessment in Machine Learning
  • Lesson 4
    • Introduction to biases and understanding biases
    • Understanding, Testing, Discovering and Mitigating Discrimination
  • Lesson 5
    • Introduction to Explainable AI
    • Explanation Techniques
 
Techniques and Tools:

pandas
sklearn
numpy
seaborn
matplotlib
fairlearn
lime
dalex
shap
lore
scikit-mobility

Tutors:
Tools:
Credits: 2
Hours: 24
 
Description:

The module is organised in lectures on case studies and real applications showing the use of Big Data analytics and
Social Mining. These lecture describe activities of the SoBigData.eu laboratory, companies and institutions which are
partners of the Master.

 
Competences:
 
Notions:
 
Techniques and Tools:
Teachers: Fagni Tiziano
Tutors:
Tools:
Credits: 2
Hours: 24
 
Description:

The module presents the characteristics and peculiarities of "big data", highlighting through specific use cases the growing importance of the ability to extract significant information and valuable insights from this enormous amount of heterogeneous data (for example data from sensors, purchase data and consumption, data from social media and social networks, open data, etc.). The participatory methods of data collection through crowdsourcing and crowdsensing systems are also discussed, showing popular examples of application of these concepts. The practical part will instead focus on data ingestion by presenting data crawling and scraping methodologies with concrete examples on Social Media and the Web, as well as on the use of pre-compiled publicly available datasets.

Prerequisites: Python

 
Competences:
  • Theoretical knowledge:
    • Characterization of "big data" and the potential obtainable in terms of knowledge resulting from their analysis
    • Data characterization: open sources, closed sources, open data and linked open date. Data collection or development of specific services that exploit groups of users (crowdsensing, crowdsourcing).
    • HTML/CSS technologies underlying the functioning of the Web
    • REST architectures
    • Social media with focus on Twitter and Reddit: analysis of the main characteristics of social networks and high-level overview of the available APIs.
  • Practical knowledge:
  • Use of HTML tags and CSS selectors for creating web pages.
  • Website scraping with concrete examples using the Selenium and Beautiful libraries Soups
  • Social media crawling with concrete examples using the Reddit API through the PRAW library.
  • Parsing of data in CSV/JSON format
 
Notions:
  • Lesson 1
    • Introduction to big data and the various data sources that characterize them
    • Open data and linked open data, crowdsourcing and crowdsensing
    • Big data analytics: interesting use cases
  • Lesson 2
    • Social media crawling: REST architecture and OAUTH authentication framework, Twitter and Reddit overview
    • Introduction to using the PRAW library for data access to Reddit + exercises with PRAW
  • Lesson 3
    • Exercises with PRAW
    • Introduction to HTML/CSS technologies
  • Lesson 4
    • HTML/CSS exercises
    • Introduction to Web scraping in Python: Selenium and Beautiful Soup
  • Lesson 5
    • Exercises on Selenium
  • Lesson 6
    • Exercises on BeautifulSoup
    • CSV/JSON data parsing
  • Exam
 
Techniques and Tools:
  • Selenium
  • Beautiful Soup
  • PRAW
     
Tutors:
Tools:
Credits: 2
Hours: 24
 
Description:

The module presents the methodological aspects, technologies and systems for designing, populating and querying Data Warehouses for decision support. The emphasis is placed on the analysis of application problems using examples and case studies, with laboratory exercises.

Prerequisites: knowledge of basic SQL, Excel, Python programming.

 
Competences:

The student will acquire knowledge and skills on the main Business Intelligence technologies such as ETL (Extract, Transform and Load), Data Warehousing, Analytic SQL, OLAP (Online Analytical Processing). It will also have references to scalability issues and NoSQL architectures.

 
Notions:
  • Lesson1: Introduction to Datawarehousing
    • OLAP vs. OLTP
    • Design phases of a DW
    • Data model: logic model
    • Case Studies
  • Lesson 2: Analytical SQL
    • ROLLUP and CUBE
    • OVER clause
    • Windowing
    • SQL Server tutorials
  • Lesson 3: Extract Transform and Load (ETL)
    • RDBMS access standard
    • ETL operations: control flow and data flow
    • The SSIS System: SQL Server Integration Services
    • Tutorials in SSIS
  • Lesson 4: Online Analytical Processing (OLAP)
    • The multidimensional model
    • The SSAS system: SQL Server Analysis Services
    • Reporting: Microsoft Power BI
    • SSAS/Power BI tutorials
  • Lesson 5: Scalability and API
    • Scalability of DW systems
    • NoSQL Data Model
    • NO-SQL Big Data Platforms
    • Python API for SQL and NoSQL
 
Techniques and Tools:

pyodbc

Tutors:
Tools:
Credits: 4
Hours: 40
 
Description:

The formidable advances in computing power, data acquisition, data storage and connectivity have created unprecedented amounts of data. Data mining, i.e., the science of extracting knowledge from these masses of data, has therefore been affirmed as an interdisciplinary branch of computer science. Data mining techniques have been applied to many industrial, scientific, and social problems, and are believed to have an ever deeper impact on society. Besides, the large availability of data enabled to build highly accurate predictive models through Machine Learning techniques. The course objective is to provide an introduction to the basic concepts of data mining and machine learning and the process of extracting knowledge, with insights into analytical and predictive models and the most common algorithms.

 
Competences:

At the end of the course the student will be able to 

  • Design a KDD process
  • Apply the different data mining & machine learning techniques on the basis of the analytical question to be answered
  • Use data mining & machine learning tools and python libraries 
  • Simulate how the data mining & machine learning algorithms work
  • Select the best algorithm for the right problem setting
 
Notions:
  • Lesson 1
    • Introduction to Data Mining
    • Data Understanding
  • Lesson 2
    • Data Preparation & Features Engineering
    • Data Similarity Measures
  • Lesson 3
    • Introduction to Clustering
    • Clustering Evaluation
    • K-Means
  • Lesson 4
    • Density-based Clustering: DBSCAN & OPTICS
    • Hierarchical Clustering: Max-Linkage & Min-Linkage
  • Lesson 5
    • Introduction to Machine Learning
    • The Classification Problem
    • Classification Evaluation Measures
  • Lesson 6
    • K Nearest Neighbor Classifier
  • Lesson 7
    • Decision Tree Classifier
  • Lesson 8
    • Support Vector Machines
  • Lesson 9
    • Random Forest Classifier
  • Lesson 10
    • Machine Learning Models for Regression
 
Techniques and Tools:
  • numpy
  • matplotlib
  • pandas
  • scipy
  • sklearn
Tutors:
Tools:
Credits: 3
Hours: 30
 
Description:

The Data Visualization and Visual Analytics course provides a comprehensive introduction to produce effective and efficient visualization and storytelling through data visualization. During the course, the students will explore the basics of visual encoding, data visualization mapping through encoding with visual variables, and visual analytics techniques.

 
Competences:
  • How to encode data and models in an efficient and effective visualization, limiting the impact of cognitive biases.
  • How to design and encode a visual representation through modern data visualization libraries
 
Notions:
  • Introduction to Data Visualization, basic concepts of visual perception, Visual Variables
  • Use cases of good and bad practices of visualizations.
  • Introduction to the library Altair
  • Visual Variables and Scales in Altair
  • Visualization of Geographical Data with Folium
  • Color models and color scales
  • Data Publishing: Principles of Web Application Design, A sample of a web application layout
 
Techniques and Tools:
Tutors:
Tools:
Credits: 1
Hours: 12
 
Description:

Building on innovation management literature, this course aims to provide a broad and updated understanding of the multi-level key issues regarding the firms’ data driven innovation process. More specifically, the course aims to present how big data could drive companies’ innovation processes. After a preliminary discussion of the key aspects that characterize companies’ innovation processes, emphasis will be placed on practical tools such as business model canvas. Then, the focus will shift to the new opportunities of innovation made possible by recent advances in the data collection and data processing techniques for big data. Finally, the key concepts and models of innovation will be re-interpreted by exploiting the potential of Big Data to open up new business opportunities. This course is based on several hands-on activities and will host a testimonial of a big data company.  The main objectives of this course are: 

  • To provide an overview of the main theoretical frameworks and analytical tools needed to disentangle the key managerial concerns behind innovation management and their overall impact on firm’s organization and performance; 
  • Equip participants with some practical tools that are very important to develop the business model of a company in the big data era. 
 
Competences:
  • Demonstrate knowledge and understanding of the theoretical frameworks and practical tools for the study and analysis of the sources, types, patterns, and management of innovation; 
  • Analyze and critically discuss the main issues in innovation management in the light of real business case examples and testimonials; 
  • Discuss information, ideas, problems, and practical solutions in the field of data driven innovation management; 
     
 
Notions:
  • Lecture 1 
    • Overview of the course
    • Basics notion of Innovation (Definitions, measures, and sources of innovation)
    • Business Model Innovation: the importance of Business model canvas 
    • Hands-on Activity_Group Work Activity on Business Model Canvas 
  • Lecture 2 
    • The role of Big Data for innovation processes and products
    • Business model canvas in Big Data companies
    • Hands-on Activity_Group work activity: Business model canvas in Big Data companies
  • Lecture 3
    • Hands-on Activity_Big data company testimonial
    • Wrap up questions
       
 
Techniques and Tools:
Tutors:
Tools:
Credits: 3
Hours: 36
 
Description:

The module presents the methodological aspects, technologies and systems for designing predictive systems of Artificial Intelligence through machine learning and deep neural networks. The emphasis is placed on the analysis of application problems using examples and case studies, with practical exercises.

Prerequisites: Python & Data Mining & Machine Learning

 
Competences:

The student will acquire knowledge and skills on the main technologies for machine learning through deep neural networks. He will also have references to the application problems of Artificial Intelligence and basic knowledge for the application of these methodologies to new problems.

 
Notions:
  • Lecture 1: Fundamentals of Machine Learning for AI
    • Introduction to the course
    • Machine learning and AI
    • Machine learning paradigms
    • Model Selection
    • Hands-on Session for data processing with Numpy / Scikit-learn
  • Lecture 2: Fundamentals of Neural Networks
    • Biological and Artificial Neuron
    • Logistic regression as a neural network with 1 Neuron
    • Hands-on Session with Numpy
  • Lecture 3: Neural Network Training
    • Optimization algorithms
    • Stochastic-Gradient Descent (SGD) and Backpropagation
    • Tricks and Tips for NNs training
    • Hands-on Session with Keras (training) and Tensorboard
  • Lecture 4: Multi-layer Perceptron (4 hours)
    • From shallow networks to deep learning
    • MLP and Deep Feedforward Networks
    • Solving an image classification problem with Keras
  • Lecture 5: Convolutional Neural Networks (CNN) (4 Hours)
    • Visual processing with neural networks
    • Convolutions and CNN building blocks
    • Hands-on Session with Keras and CNNs
  • Lecture 6: Recurrent Neural Networks (RNN) (4 Hours)
    • Vanilla Recurrent Networks
    • Gated Recurrent Models (LSTM)
    • Hands-on Session with Keras for Sequence Classification
  • Lecture 7: Neural Networks Applications (4 hours)
    • Deep neural networks tools and libraries (Tensorflow and Pytorch)
    • NNs for Computer Vision Applications (Recognition, Detection and Segmentation)
    • NNs for Time-series (forecasting and classification)
    • Use of pre-trained models in Keras
  • Lecture 8: Advanced Deep Learning topics (4 Hours)
    • Autoencoders
    • Generative models
    • Continual Learning
    • Recent applications
  • Lecture 9: Application of AI-based Deep Learning Methods 
     
 
Techniques and Tools:
  • Tensorflow
  • Keras
  • Pytorch
  • Scikit-learn
  • Numpy
  • Pandas
  • Matplotlib
  • Seaborn
Teachers: Ferragina Paolo
Tutors:
Tools:
Credits: 3
Hours: 36
 
Description:

The course introduces the design, implementation and analysis of Information Retrieval systems that are efficient and effective in managing and searching for information stored in the form of collections of texts, possibly unstructured (e.g. Web), and labeled graphs (e.g. Knowledge graph). The theoretical lessons will describe the main components of a modern Information Retrieval system, more exactly of a search engine, such as: crawler, text analyzer, storage and compressed index, query solver, text annotator (based on Knowledge graph and Entity linkers), and rankers. The laboratory lessons will put into practice what has been learned "in theory" with the help of three famous software libraries such as: ElasticSearch (an open-source search engine), Neo4J (a graphDB), TagMe and Swat (two entity annotators). The exam will consist of a written test, aimed at evaluating the knowledge acquired both in the theoretical lessons and in the laboratory lessons (weight 60%), and of a joint software project with various other courses (weight 40%), whose objective is to to evaluate the technical skills in the use of the aforementioned libraries.

Prerequisites: Basic notions of algorithms, programming and use of Python programming environments.

 
Competences:
 
Notions:
  • Lesson 1
    • Introduction to IR and history of search engines
    • The structure of a search engine
    • Inverted lists and query resolution for AND and soft-AND, phrase, proximity, and zone
  • Lesson 2 
    • The Web graph, its structure (bow tie), its properties, its representation in memory and an example of browsing algorithms (BFS and DFS).
  • Lesson 3
    • The crawling module, the parsing module, keyword extraction (with PoS tag, Rake, and statistics).
  • Lesson 4 
    • Creation in Python of a text parser (tokenization, stopword, normalization and stemming), and of a word cloud.
  • Lesson 5 
    • The first generation of search engines (Altavista, Lycos,…)
    • The laws of Zipf, Heaps and Luhn
    • Textual ranking: Jaccard and TF-IDF
    • The vector space model and cosine similarity
    • Text spam.
  • Lesson 6
    • ElasticSearch.
  • Lesson 7:
    • The second generation of search engines (google et al)
    • Ranking based on the Web graph, random walk and PageRank, Topic-based and Personalized PageRank.
    • Evaluation of a search engine: precision, recall and F1.
  • Lesson 8
    • Knowledge graph and latest generation search engines
    • Entity linkers and semantic text annotation TagMe
    • Entity linker applications: keyword extraction, representation and comparison of texts using labeled and weighted graphs, reasoning.
    • Use of TagMe library and Swat library.
  • Lesson 9
    • Definition, properties and functionality of GraphDBs using the Neo4J library.
 
Techniques and Tools:
  • requests
  • elasticsearch
  • nltk
  • wordcloud
  • matplotlib
     
Teachers: Partners
Tutors:
Area: Generico
Tools:
Credits: 18
Hours: 475
 
Description:

The master requires an internship to be carried out at one of the partners (companies or institutions) or on the current company a student is working on, on the basis of a well defined project work and under the supervision of a team of tutors composed of instructors and company experts. The internship might require in presence work at the partners' offices or hybrid solutions with smart working.

 
Competences:

Thesis report of the activities done during the internship including the results achieved.

 
Notions:
 
Techniques and Tools:
Tutors:
Tools:
Credits: 4
Hours: 48
 
Description:

In this module groups of students will be guided to design and develop an entire project in Big Data and AI: from data collection to the final delivery. The students will employ in the project methods, techniques and tools studied in the other modules. The duration of this module, differently from the others, will span across several months until the end of the lectures when the results of the project will be presented in front of a committee.

 
Competences:

Putting together all the competencies learned during the Master.

 
Notions:
  • Big Project structure and activities presentation
  • Project assignments and definitions
  • Project Proposal
  • Data collection
  • Data analysis
  • Project Verification
  • Interviews
  • Workflows design and models training
  • Results evaluation
  • Results dissemination
  • Website
  • Project Evaluation
 
Techniques and Tools:
Tutors:
Tools:
Credits: 2
Hours: 24
 
Description:

Over the past decade, there has been a growing public fascination with the complex “connectedness” of modern society. This connectedness is found in many contexts: in the rapid growth of the Internet and the Web, in the ease with which global communication now takes place, and in the ability of news and information as well as epidemics and financial crises to spread around the world with surprising speed and intensity. These are phenomena that involve networks and the aggregate behavior of groups of people; they are based on the links that connect us and the ways in which each of our decisions can have subtle consequences for the outcomes of everyone else. This crash course is an introduction to the analysis of complex networks, made possible by the availability of big data, with a special focus on the social network and its structure and function. Drawing on ideas from computing and information science, complex systems, mathematic and statistical modeling, economics, and sociology, this lecture sketchily describes the emerging field of study that is growing at the interface of all these areas, addressing fundamental questions about how the social, economic, and technological worlds are connected.

Prerequisites: Python, Data Mining

 
Competences:

Complex networks modeling and analysis
 

 
Notions:
  • Lecture 1: Intro: Why should we care about Complex Networks? Networks & Graphs: Basic Measures 
  • Lecture 2: Random Networks, Small World property, Scale Free networks
  • Lecture 3: Measuring Node Centrality & Tie Strength
  • Lecture 4: Community Detection
  • Lecture 5: Resilience to attacks and failures
  • Lecture 4: Epidemics
     
 
Techniques and Tools:

networkx

cdlib

ndlib

Teachers: Lillo Fabrizio
Tutors:
Tools: R
Credits: 2
Hours: 24
 
Description:

The course introduces the student to the main concepts of statistical analysis, the methods used and the software implementations to carry out a quantitative and rigorous study of a dataset. After introducing the basic tools of descriptive statistics, the course focuses on probabilistic statistics and its use for data modelling, estimation methods through an inferential approach and statistical hypothesis testing. The course also introduces the concepts of linear and logistic regression (also multivariate) and the computational bootstrap techniques for estimating parameters and confidence intervals.

Prerequisites: Python

 
Competences:
  • Know how to use, and understand, the main tools of descriptive and probabilistic statistics.
  • Know how to conduct a statistical analysis of a dataset
  • Build a probabilistic model, estimate the model parameters, verify its goodness and use it in a predictive mode.
 
Notions:
  • Lesson 1: Introduction to data and descriptive statistics
  • Lesson 2: Basic concepts of probability and use for data modeling
  • Lesson 3: Statistical inference. Statistical estimation methods. Correlation and dependence
  • Lesson 4: Simple and multiple regression, logistic regression and classification
  • Lesson 5: Confidence intervals. Bootstrap
  • Lesson 6: Hypothesis testing
 
Techniques and Tools:
Tutors:
Tools:
Credits: 3
Hours: 30
 
Description:

The course will deal with time series and spatio-temporal data, in particular mobility. We will illustrate the fundamental characteristics of these two data classes as well as the most common pre-processing and analysis methods. Finally, each lesson will provide examples of use and exercises carried out in Python with the appropriate libraries.

Prerequisites: Data Mining & Machine Learning, Python

 
Competences:
  • Conoscenza delle caratteristiche fondamentali di varie sorgenti di dati per time series e mobilità
  • Conoscenza di metodi analitici di base (predittivi, clustering e pattern) per time series e dati di mobilità
  • Capacità di realizzazione di semplici processi analitici in python per time series e dati di mobilità
 
Notions:
  • Lesson 1: Time Series: characteristics and similarity measures
  • Lesson 2: Time Series: patterns (Motifs, Discords, Sequential patterns)
  • Lesson 3: Time Series: forecasting
  • Lesson 4: Introduction to Geospatial Analytics and fundamental concepts
  • Lesson 5: Geospatial and Mobility data preprocessing and semantic enrichment
  • Lesson 6: Individual & Collective mobility laws and models
  • Lesson 7: Mobility Patterns and Location prediction
 
Techniques and Tools:
  • scikit-mobility
  • pandas
  • tslearn
     

Partners