Didattica 2024-2025
Il Master Big Data è un master full-time erogato interamente online che avrà durata di un anno a partire da Novembre. L'attività didattica prevede due fasi principali: la prima fase, che copre il periodo da Novembre a fine Luglio, è dedicata alle lezioni frontali e attività progettuali; mentre la seconda fase, che copre il periodo Agosto-Dicembre, è dedicata al tirocinio di 475 ore che gli studenti svolgeranno presso i nostri partners.
L'attività didattica settimanale durante la prima fase prevede alcune ore di lezioni frontali concentrate dal mercoledì al sabato e delle ore di laboratorio (non obbligatorie) durante le quali gli studenti potranno esercitarsi sperimentando sul campo i metodi presentati a lezione con il supporto di tutor. Il consiglio del Master ha stabilito che ai fini del conseguimento del titolo sarà necessaria la presenza obbligatoria dello studente ad almeno il 70% delle ore di lezioni frontali. L'organizzazione didattica settimanale è descritta in dettaglio nelle seguenti immagini:
Crediti Formativi Universitari (CFU)
Il Credito Formativo Universitario (CFU) è l'unità di misura del volume di lavoro di apprendimento, richiesto ad un allievo in possesso di adeguata preparazione iniziale, per l'acquisizione di conoscenze ed abilità richieste da una certa attività formativa. Esso corrisponde a 25 ore di lavoro complessivo, che comprende sia le ore di didattica frontale, sia lo studio individuale, sia altri tipi di attività (come il tirocinio). Ogni attività formativa ha associato un certo numero di crediti, che vengono acquisiti con il superamento di una verifica del profitto, e non sostituiscono il voto.
L'attività didattica sarà supportata dall'uso della piattaforma Moodle, dove gli studenti troveranno tutto il materiale didattico.
Insegnamenti 2024-2025
Gli insegnamenti previsti, con i rispettivi obiettivi formativi, sono elencati di seguito.
The module has the aim to align the students' competences in computer science and in basic analytics, especially in data bases, and Python programming for data science. Starting form a theoretical introduction to the basics of programming and relational database modelling the course will be focused on pratical lectures for learning to query and modelling databases and to solving problems by writing Python programs in both static and dynimic environments. This module is based on hands-on work
At the end of the course the student will be able to
- Model and query a database
- Approach a problem with an adequate problem solving strategy
- Implement simple Python programs
- Lesson 1
- Introduction to Python: data types, variables
- Lesson 2
- Functions, environments of functions variables, input-output, Conditional Statements, Loops
- Lesson 3
- Strings, tuples, lists, dictionaries
- Lesson 4
- File Access, debugging, environment management
- Lesson 5
- Object oriented programming
- Lesson 6
- Numpy & Pandas
- SQL
- Microsoft SQL Server
- Python
- Numpy
- Pandas
This module presents artificial intelligence techniques aimed at defining analytics on text and data from the Web. The course is organized around three main strands: i) text analytics, where text mining methods applied to texts and social media are studied; ii) sorting techniques through the application of "learning to rank" techniques which have the purpose of estimating the relevance of objects with respect to user requirements, iii) web mining techniques aimed at exploiting user usage data to improve quality of services. Using the query logs of a real search engine as a case study, students will be guided in the development of a set of methodologies for data analysis that aims to create the knowledge base necessary to build a recommender system.
Prerequisites: Machine Learning and Python
Ability to correctly identify and implement text and web analytics. Ability to use state-of-the-art solutions for text classification, sentiment analysis, sentiment classification, ranking. Ability to use learning to rank techniques and Transformer networks for text (BERT). Ability to define a Web mining problem and design a solution.
- Text Analytics
- Properties of the Language and its Representation
- Analytics on Text: Tasks, Methods, Applications
- Language Models: from pure statistical approaches to learned complex solutions
- Sparse vs. Dense Representations with Neural Approaches
- Sentiment Analysis & Classification
- Sentiment Analysis in Python
- Ranking
- Machine Learning for Ranking: from standard techniques to BERT
- Applications of Neural Networks to Text Ranking: Haystack & HuggingFace
- Ranking with BERT
- Web Mining
- Analytics on Web Usage Data: query log mining for recommendation
- Methods for Query Suggestion
- Query Suggestion in Python ed ElasticSearch
Python libraries:
- NLTK,
- SpaCy
- Scikit-learn
- GenSim
- VADER
- Keras
- Pytorch
- Haystack
- Huggingface Transformers
- ElasticSearch
- LightGBM
- BASH
The module introduces the ethical and legal notions of privacy, anonymity, transparency and discrimination, even considering the General Data Protection Regulation. It presents technologies for implementing the privacy-by-design principle, for auditing of predictive models, and for the protection of users rights with the goal of enabling the Big Data analysis while guaranteeing personal data protection, transparency and non-discrimination.
At the end of the course the student will be able to analyze the ethical issues in a knowledge discovery process also referring the EU legal framework and will acquire knowledge about some available tools for assessing ethical issues.
- Lesson 1
- Introduction to Big Data Ethics
- The European legal framework
- Lesson 2
- Privacy-by-Design in Big Data Analytics
- Data Protection, Privacy and Privacy Models
- Lesson 3
- Privacy Risk Assessment & Prediction
- Privacy-Protection Techniques
- Privacy Assessment in Machine Learning
- Lesson 4
- Introduction to biases and understanding biases
- Understanding, Testing, Discovering and Mitigating Discrimination
- Lesson 5
- Introduction to Explainable AI
- Explanation Techniques
pandas
sklearn
numpy
seaborn
matplotlib
fairlearn
lime
dalex
shap
lore
scikit-mobility
The module is organised in lectures on case studies and real applications showing the use of Big Data analytics and
Social Mining. These lecture describe activities of the SoBigData.eu laboratory, companies and institutions which are
partners of the Master.
The module presents the characteristics and peculiarities of "big data", highlighting through specific use cases the growing importance of the ability to extract significant information and valuable insights from this enormous amount of heterogeneous data (for example data from sensors, purchase data and consumption, data from social media and social networks, open data, etc.). The participatory methods of data collection through crowdsourcing and crowdsensing systems are also discussed, showing popular examples of application of these concepts. The practical part will instead focus on data ingestion by presenting data crawling and scraping methodologies with concrete examples on Social Media and the Web, as well as on the use of pre-compiled publicly available datasets.
Prerequisites: Python
- Theoretical knowledge:
- Characterization of "big data" and the potential obtainable in terms of knowledge resulting from their analysis
- Data characterization: open sources, closed sources, open data and linked open date. Data collection or development of specific services that exploit groups of users (crowdsensing, crowdsourcing).
- HTML/CSS technologies underlying the functioning of the Web
- REST architectures
- Social media with focus on Twitter and Reddit: analysis of the main characteristics of social networks and high-level overview of the available APIs.
- Practical knowledge:
- Use of HTML tags and CSS selectors for creating web pages.
- Website scraping with concrete examples using the Selenium and Beautiful libraries Soups
- Social media crawling with concrete examples using the Reddit API through the PRAW library.
- Parsing of data in CSV/JSON format
- Lesson 1
- Introduction to big data and the various data sources that characterize them
- Open data and linked open data, crowdsourcing and crowdsensing
- Big data analytics: interesting use cases
- Lesson 2
- Social media crawling: REST architecture and OAUTH authentication framework, Twitter and Reddit overview
- Introduction to using the PRAW library for data access to Reddit + exercises with PRAW
- Lesson 3
- Exercises with PRAW
- Introduction to HTML/CSS technologies
- Lesson 4
- HTML/CSS exercises
- Introduction to Web scraping in Python: Selenium and Beautiful Soup
- Lesson 5
- Exercises on Selenium
- Lesson 6
- Exercises on BeautifulSoup
- CSV/JSON data parsing
- Exam
- Selenium
- Beautiful Soup
- PRAW
The module presents the methodological aspects, technologies and systems for designing, populating and querying Data Warehouses for decision support. The emphasis is placed on the analysis of application problems using examples and case studies, with laboratory exercises.
Prerequisites: knowledge of basic SQL, Excel, Python programming.
The student will acquire knowledge and skills on the main Business Intelligence technologies such as ETL (Extract, Transform and Load), Data Warehousing, Analytic SQL, OLAP (Online Analytical Processing). It will also have references to scalability issues and NoSQL architectures.
- Lesson1: Introduction to Datawarehousing
- OLAP vs. OLTP
- Design phases of a DW
- Data model: logic model
- Case Studies
- Lesson 2: Analytical SQL
- ROLLUP and CUBE
- OVER clause
- Windowing
- SQL Server tutorials
- Lesson 3: Extract Transform and Load (ETL)
- RDBMS access standard
- ETL operations: control flow and data flow
- The SSIS System: SQL Server Integration Services
- Tutorials in SSIS
- Lesson 4: Online Analytical Processing (OLAP)
- The multidimensional model
- The SSAS system: SQL Server Analysis Services
- Reporting: Microsoft Power BI
- SSAS/Power BI tutorials
- Lesson 5: Scalability and API
- Scalability of DW systems
- NoSQL Data Model
- NO-SQL Big Data Platforms
- Python API for SQL and NoSQL
pyodbc
The formidable advances in computing power, data acquisition, data storage and connectivity have created unprecedented amounts of data. Data mining, i.e., the science of extracting knowledge from these masses of data, has therefore been affirmed as an interdisciplinary branch of computer science. Data mining techniques have been applied to many industrial, scientific, and social problems, and are believed to have an ever deeper impact on society. Besides, the large availability of data enabled to build highly accurate predictive models through Machine Learning techniques. The course objective is to provide an introduction to the basic concepts of data mining and machine learning and the process of extracting knowledge, with insights into analytical and predictive models and the most common algorithms.
At the end of the course the student will be able to
- Design a KDD process
- Apply the different data mining & machine learning techniques on the basis of the analytical question to be answered
- Use data mining & machine learning tools and python libraries
- Simulate how the data mining & machine learning algorithms work
- Select the best algorithm for the right problem setting
- Lesson 1
- Introduction to Data Mining
- Data Understanding
- Lesson 2
- Data Preparation & Features Engineering
- Data Similarity Measures
- Lesson 3
- Introduction to Clustering
- Clustering Evaluation
- K-Means
- Lesson 4
- Density-based Clustering: DBSCAN & OPTICS
- Hierarchical Clustering: Max-Linkage & Min-Linkage
- Lesson 5
- Introduction to Machine Learning
- The Classification Problem
- Classification Evaluation Measures
- Lesson 6
- K Nearest Neighbor Classifier
- Lesson 7
- Decision Tree Classifier
- Lesson 8
- Support Vector Machines
- Lesson 9
- Random Forest Classifier
- Lesson 10
- Machine Learning Models for Regression
- numpy
- matplotlib
- pandas
- scipy
- sklearn
The Data Visualization and Visual Analytics course provides a comprehensive introduction to produce effective and efficient visualization and storytelling through data visualization. During the course, the students will explore the basics of visual encoding, data visualization mapping through encoding with visual variables, and visual analytics techniques.
- How to encode data and models in an efficient and effective visualization, limiting the impact of cognitive biases.
- How to design and encode a visual representation through modern data visualization libraries
- Introduction to Data Visualization, basic concepts of visual perception, Visual Variables
- Use cases of good and bad practices of visualizations.
- Introduction to the library Altair
- Visual Variables and Scales in Altair
- Visualization of Geographical Data with Folium
- Color models and color scales
- Data Publishing: Principles of Web Application Design, A sample of a web application layout
Building on innovation management literature, this course aims to provide a broad and updated understanding of the multi-level key issues regarding the firms’ data driven innovation process. More specifically, the course aims to present how big data could drive companies’ innovation processes. After a preliminary discussion of the key aspects that characterize companies’ innovation processes, emphasis will be placed on practical tools such as business model canvas. Then, the focus will shift to the new opportunities of innovation made possible by recent advances in the data collection and data processing techniques for big data. Finally, the key concepts and models of innovation will be re-interpreted by exploiting the potential of Big Data to open up new business opportunities. This course is based on several hands-on activities and will host a testimonial of a big data company. The main objectives of this course are:
- To provide an overview of the main theoretical frameworks and analytical tools needed to disentangle the key managerial concerns behind innovation management and their overall impact on firm’s organization and performance;
- Equip participants with some practical tools that are very important to develop the business model of a company in the big data era.
- Demonstrate knowledge and understanding of the theoretical frameworks and practical tools for the study and analysis of the sources, types, patterns, and management of innovation;
- Analyze and critically discuss the main issues in innovation management in the light of real business case examples and testimonials;
- Discuss information, ideas, problems, and practical solutions in the field of data driven innovation management;
- Lecture 1
- Overview of the course
- Basics notion of Innovation (Definitions, measures, and sources of innovation)
- Business Model Innovation: the importance of Business model canvas
- Hands-on Activity_Group Work Activity on Business Model Canvas
- Lecture 2
- The role of Big Data for innovation processes and products
- Business model canvas in Big Data companies
- Hands-on Activity_Group work activity: Business model canvas in Big Data companies
- Lecture 3
- Hands-on Activity_Big data company testimonial
- Wrap up questions
The module presents the methodological aspects, technologies and systems for designing predictive systems of Artificial Intelligence through machine learning and deep neural networks. The emphasis is placed on the analysis of application problems using examples and case studies, with practical exercises.
Prerequisites: Python & Data Mining & Machine Learning
The student will acquire knowledge and skills on the main technologies for machine learning through deep neural networks. He will also have references to the application problems of Artificial Intelligence and basic knowledge for the application of these methodologies to new problems.
- Lecture 1: Fundamentals of Machine Learning for AI
- Introduction to the course
- Machine learning and AI
- Machine learning paradigms
- Model Selection
- Hands-on Session for data processing with Numpy / Scikit-learn
- Lecture 2: Fundamentals of Neural Networks
- Biological and Artificial Neuron
- Logistic regression as a neural network with 1 Neuron
- Hands-on Session with Numpy
- Lecture 3: Neural Network Training
- Optimization algorithms
- Stochastic-Gradient Descent (SGD) and Backpropagation
- Tricks and Tips for NNs training
- Hands-on Session with Keras (training) and Tensorboard
- Lecture 4: Multi-layer Perceptron (4 hours)
- From shallow networks to deep learning
- MLP and Deep Feedforward Networks
- Solving an image classification problem with Keras
- Lecture 5: Convolutional Neural Networks (CNN) (4 Hours)
- Visual processing with neural networks
- Convolutions and CNN building blocks
- Hands-on Session with Keras and CNNs
- Lecture 6: Recurrent Neural Networks (RNN) (4 Hours)
- Vanilla Recurrent Networks
- Gated Recurrent Models (LSTM)
- Hands-on Session with Keras for Sequence Classification
- Lecture 7: Neural Networks Applications (4 hours)
- Deep neural networks tools and libraries (Tensorflow and Pytorch)
- NNs for Computer Vision Applications (Recognition, Detection and Segmentation)
- NNs for Time-series (forecasting and classification)
- Use of pre-trained models in Keras
- Lecture 8: Advanced Deep Learning topics (4 Hours)
- Autoencoders
- Generative models
- Continual Learning
- Recent applications
- Lecture 9: Application of AI-based Deep Learning Methods
- Tensorflow
- Keras
- Pytorch
- Scikit-learn
- Numpy
- Pandas
- Matplotlib
- Seaborn
The course introduces the design, implementation and analysis of Information Retrieval systems that are efficient and effective in managing and searching for information stored in the form of collections of texts, possibly unstructured (e.g. Web), and labeled graphs (e.g. Knowledge graph). The theoretical lessons will describe the main components of a modern Information Retrieval system, more exactly of a search engine, such as: crawler, text analyzer, storage and compressed index, query solver, text annotator (based on Knowledge graph and Entity linkers), and rankers. The laboratory lessons will put into practice what has been learned "in theory" with the help of three famous software libraries such as: ElasticSearch (an open-source search engine), Neo4J (a graphDB), TagMe and Swat (two entity annotators). The exam will consist of a written test, aimed at evaluating the knowledge acquired both in the theoretical lessons and in the laboratory lessons (weight 60%), and of a joint software project with various other courses (weight 40%), whose objective is to to evaluate the technical skills in the use of the aforementioned libraries.
Prerequisites: Basic notions of algorithms, programming and use of Python programming environments.
- Lesson 1
- Introduction to IR and history of search engines
- The structure of a search engine
- Inverted lists and query resolution for AND and soft-AND, phrase, proximity, and zone
- Lesson 2
- The Web graph, its structure (bow tie), its properties, its representation in memory and an example of browsing algorithms (BFS and DFS).
- Lesson 3
- The crawling module, the parsing module, keyword extraction (with PoS tag, Rake, and statistics).
- Lesson 4
- Creation in Python of a text parser (tokenization, stopword, normalization and stemming), and of a word cloud.
- Lesson 5
- The first generation of search engines (Altavista, Lycos,…)
- The laws of Zipf, Heaps and Luhn
- Textual ranking: Jaccard and TF-IDF
- The vector space model and cosine similarity
- Text spam.
- Lesson 6
- ElasticSearch.
- Lesson 7:
- The second generation of search engines (google et al)
- Ranking based on the Web graph, random walk and PageRank, Topic-based and Personalized PageRank.
- Evaluation of a search engine: precision, recall and F1.
- Lesson 8
- Knowledge graph and latest generation search engines
- Entity linkers and semantic text annotation TagMe
- Entity linker applications: keyword extraction, representation and comparison of texts using labeled and weighted graphs, reasoning.
- Use of TagMe library and Swat library.
- Lesson 9
- Definition, properties and functionality of GraphDBs using the Neo4J library.
- requests
- elasticsearch
- nltk
- wordcloud
- matplotlib
The master requires an internship to be carried out at one of the partners (companies or institutions) or on the current company a student is working on, on the basis of a well defined project work and under the supervision of a team of tutors composed of instructors and company experts. The internship might require in presence work at the partners' offices or hybrid solutions with smart working.
Thesis report of the activities done during the internship including the results achieved.
In this module groups of students will be guided to design and develop an entire project in Big Data and AI: from data collection to the final delivery. The students will employ in the project methods, techniques and tools studied in the other modules. The duration of this module, differently from the others, will span across several months until the end of the lectures when the results of the project will be presented in front of a committee.
Putting together all the competencies learned during the Master.
- Big Project structure and activities presentation
- Project assignments and definitions
- Project Proposal
- Data collection
- Data analysis
- Project Verification
- Interviews
- Workflows design and models training
- Results evaluation
- Results dissemination
- Website
- Project Evaluation
Over the past decade, there has been a growing public fascination with the complex “connectedness” of modern society. This connectedness is found in many contexts: in the rapid growth of the Internet and the Web, in the ease with which global communication now takes place, and in the ability of news and information as well as epidemics and financial crises to spread around the world with surprising speed and intensity. These are phenomena that involve networks and the aggregate behavior of groups of people; they are based on the links that connect us and the ways in which each of our decisions can have subtle consequences for the outcomes of everyone else. This crash course is an introduction to the analysis of complex networks, made possible by the availability of big data, with a special focus on the social network and its structure and function. Drawing on ideas from computing and information science, complex systems, mathematic and statistical modeling, economics, and sociology, this lecture sketchily describes the emerging field of study that is growing at the interface of all these areas, addressing fundamental questions about how the social, economic, and technological worlds are connected.
Prerequisites: Python, Data Mining
Complex networks modeling and analysis
- Lecture 1: Intro: Why should we care about Complex Networks? Networks & Graphs: Basic Measures
- Lecture 2: Random Networks, Small World property, Scale Free networks
- Lecture 3: Measuring Node Centrality & Tie Strength
- Lecture 4: Community Detection
- Lecture 5: Resilience to attacks and failures
- Lecture 4: Epidemics
networkx
cdlib
ndlib
The course introduces the student to the main concepts of statistical analysis, the methods used and the software implementations to carry out a quantitative and rigorous study of a dataset. After introducing the basic tools of descriptive statistics, the course focuses on probabilistic statistics and its use for data modelling, estimation methods through an inferential approach and statistical hypothesis testing. The course also introduces the concepts of linear and logistic regression (also multivariate) and the computational bootstrap techniques for estimating parameters and confidence intervals.
Prerequisites: Python
- Know how to use, and understand, the main tools of descriptive and probabilistic statistics.
- Know how to conduct a statistical analysis of a dataset
- Build a probabilistic model, estimate the model parameters, verify its goodness and use it in a predictive mode.
- Lesson 1: Introduction to data and descriptive statistics
- Lesson 2: Basic concepts of probability and use for data modeling
- Lesson 3: Statistical inference. Statistical estimation methods. Correlation and dependence
- Lesson 4: Simple and multiple regression, logistic regression and classification
- Lesson 5: Confidence intervals. Bootstrap
- Lesson 6: Hypothesis testing
The course will deal with time series and spatio-temporal data, in particular mobility. We will illustrate the fundamental characteristics of these two data classes as well as the most common pre-processing and analysis methods. Finally, each lesson will provide examples of use and exercises carried out in Python with the appropriate libraries.
Prerequisites: Data Mining & Machine Learning, Python
- Conoscenza delle caratteristiche fondamentali di varie sorgenti di dati per time series e mobilità
- Conoscenza di metodi analitici di base (predittivi, clustering e pattern) per time series e dati di mobilità
- Capacità di realizzazione di semplici processi analitici in python per time series e dati di mobilità
- Lesson 1: Time Series: characteristics and similarity measures
- Lesson 2: Time Series: patterns (Motifs, Discords, Sequential patterns)
- Lesson 3: Time Series: forecasting
- Lesson 4: Introduction to Geospatial Analytics and fundamental concepts
- Lesson 5: Geospatial and Mobility data preprocessing and semantic enrichment
- Lesson 6: Individual & Collective mobility laws and models
- Lesson 7: Mobility Patterns and Location prediction
- scikit-mobility
- pandas
- tslearn