# Probability and Statistics Seminar

## Past

- 19/09/2017, 11:00 — 12:00 — Room P3.10, Mathematics Building

Luísa Canto e Castro Loura,*Direção-Geral de Estatísticas da Educação e Ciência* -
### Education: a challenge to statistical modelling

In Education, the main random variable is, in general, the knowledge-level of the student on a certain subject. This is generally considered to be the result of a process that involves one or more educational agent (e.g. teachers) and the reflection and training of the student. There are many covariates including: innate capacities, resiliency, perseverance, family and health conditions, and quality of education.

While being a subject that all of us has some a priori knowledge (essential in statistical modelling), it has an initial difficulty that may lead to significant bias and wrong interpretations: the measurement of the main random variable. From Educational psychometrics, measurement tools are the tests or examinations and the main variable is a latent variable.

In this talk we shall review some recent advances in statistics applied to education, namely to international evaluation of students and to the assessment of the impact of educational political policies. We also review the main data basis from DGEEC (Direção Geral de Estatísticas da Educação e Ciência) and its use for research purposes.

- 23/05/2017, 11:00 — 12:00 — Room P3.10, Mathematics Building

Pedro Macedo,*CIDMA and Department of Mathematics, University of Aveiro* -
### Some hot topics of maximum entropy research in economics and statistics

Maximum entropy is often used for solving ill-posed problems that occur in diverse areas of science (e.g., physics, informatics, biology, medicine, communication engineering, statistics and economics). The works of Kullback, Leibler, Lindley and Jaynes in the fifties of the last century were fundamental to connect the areas of maximum entropy and information theory with statistical inference. Jaynes states that the maximum entropy principle is a simple and straightforward idea. Indeed, it provides a simple tool to make the best prediction (i.e., the one that is the most strongly indicated) from the available information and it can be seen as an extension of the Bernoulli's principle of insufficient reason. The maximum entropy principle provides an unambiguous solution for ill-posed problems by choosing the distribution of probabilities that maximizes the Shannon entropy measure. Some recent research in regularization (e.g., ridGME and MERGE estimators), variable selection (e.g., normalized entropy with information from the ridge trace), inhomogeneous large-scale data (e.g., normalized entropy as an alternative to maximin aggregation) and stochastic frontier analysis (e.g., generalized maximum entropy and generalized cross-entropy with data envelopment analysis as an alternative to maximum likelihood estimation) will be presented, along with several real-world applications in engineering, medicine and economics.

- 16/05/2017, 11:00 — 12:00 — Room P3.10, Mathematics Building

Aurelio Tobias,*Institute of Environmental Assessment and Water Research and CSIC, Barcelona* -
### Statistical models for the relationship between daily temperature and mortality

The association between daily ambient temperature and health outcomes has been frequently investigated based on a time series design. The temperature–mortality relationship is often found to be substantially nonlinear and to persist, but change shape, with increasing lag. Thus, the statistical framework has gained a substantial development during last years. In this talk I describe the general features of time series regression, outlining the analysis process to model short-term fluctuations in the presence of seasonal and long-term pattern. I also offer an overview of the recent extend family of distributed lag non-linear models (DLNM), a modelling framework that can simultaneously represent non-linear exposure–response dependencies and delayed effects. To illustrate the methodology, I use an example to represent the relationship between temperature and mortality, using data from the MCC Collaborative Research Network, an international research program on the association between weather and health.

- 18/04/2017, 11:00 — 12:00 — Room P3.10, Mathematics Building

Cláudia Nunes,*DM-IST; CEMAT* -
### On a class of optimal stopping problems with applications to real option theory

We consider an optimal stopping time problem related with many models found in real options problems. The main goal of this work is to bring for the field of real options different and more realistic pay-off functions, and negative interest rates. Thus, we present analytical solutions for a wide class of pay-off functions, considering quite general assumptions over the model. Also, an extensive and general sensitivity analysis to the solutions, and an economic example which highlight the mathematical difficulties in the standard approaches, are provided.

(joint work with Manuel Guerra and Carlos Oliveira)

- 28/03/2017, 11:00 — 12:00 — Room P3.10, Mathematics Building

Susana Barbosa,*INESC TEC, Centre for Information Systems and Computer Graphics, Porto* -
### Applied environmental time series analysis

Since the very beginning developments in data analysis and time series methods have been closely associated with phenomena in the natural environment, from the power spectra of Schuster motivated by earthquakes and sunspots, Walker's analysis of the Southern Oscillation in attempting to predict the indian monsoon, Tukey's cross-spectrum for the analysis of seismic waves, Gumbel's extreme value analysis inspired by meteorological and hydrological phenomena or Hurst's long memory concept from observations of Nile's water levels. In the modern era of disposable, low-power devices and cheap storage, the natural environment is monitored at an incredible pace, yielding copious time series of high-resolution (sub-hourly) observations which are widely available. Despite the many computationally-intensive approaches developed and currently available to handle streams of data, their success in producing new and environmentally-relevant information is surprisingly low. An obvious challenge is the integration of problem-related knowledge and context in the data analysis process, and going from data summaries/visualisations/alarms to an exploratory analysis aiming to discover new and physically-relevant information from the environmental data. This talk addresses the practical challenges and opportunities in the analysis of high-resolution environmental time series, as illustrated by time series of environmental radioactivity and by measurements from the ongoing gamma radiation monitoring campaign in the Azores.

- 21/03/2017, 11:00 — 12:00 — Room P3.10, Mathematics Building

Alexandra Monteiro,*CESAM, Department of Environment and Planning, University of Aveiro* -
### Air quality science: putting statistics to work

Several statistical tools have been used to analyse air quality data with different purposes. This talk will highlight some of these examples and how the different statistical tools can be bring an added value for this scientific environmental area. First, changes in pollutant concentrations were examined and clustered by means of quantile regression, which allows to analyse the trends not only in the mean but in the overall data distribution. The clustering procedure has shown/indicated where the largest trends are found, in terms of space (location) and quantiles. Secondly, the resulting individual variance/covariance profiles of a set of air quality hourly time series are embedded in a wavelet decomposition-based clustering algorithm in order to identify groups of stations exhibiting similar profiles. The results clearly indicate a geographical pattern among different type of stations and allowed to identify sites which need revision concerning classification according to environment/ influence type. Both exercises were particular important for air quality management practices, in particular regarding the design of the national monitoring network.

- 14/03/2017, 11:00 — 12:00 — Room P3.10, Mathematics Building

Sandra Dias,*CMAT - Pólo UTAD and CEMAT* -
### The max-semistable laws: characterization, estimation and testing

In this talk we present the class of max-semistable distribution functions that appear as the limit, in distribution, of the maximum, suitably centered and normalized, of $k_n$ independent and identically distributed random variables, where $k_n$ is an integer-valued geometric sequence with ratio $r$ (larger or equal to $1$). This class of distributions includes all the max-stable distributions but also multimodal distributions and discrete distributions. We will characterize the max-semistable laws, discuss the estimation of the parameters and the fractal component and propose a test that allow us to distinguish between max-stable and max-semistable laws.

Join work with Luísa Canto e Castro and Maria da Graça Temido.

- 07/03/2017, 11:00 — 12:00 — Room P3.10, Mathematics Building

Manuel Cabral Morais,*DM-Instituto Superior Técnico; CEMAT* -
### Comparison of joint schemes for multivariate normal i.i.d. output

The performance of a product frequently relies on more than one quality characteristic. In such a setting, joint control schemes are used to determine whether or not we are in the presence of unfavorable disruptions in the location and spread of a vector of quality characteristics. A common joint scheme for multivariate output comprises two constituent control charts: one for the mean vector based on a weighted Mahalanobis distance between the vector of sample means and the target mean vector; another one for the covariance matrix depending on the ratio between the determinants of the sample covariance matrix and the target covariance matrix. Since we are well aware that there are plenty of quality control practitioners who are still reluctant to use sophisticated control statistics, this paper tackles Shewhart-type charts for the location and spread based on a few pairs of control statistics that depend on the nominal mean vector and covariance matrix. We recall or derive the joint probability density functions of these pairs of control statistics in order to investigate the impact on the ability of the associated joint schemes to detect shifts in the process mean vector or covariance matrix for various out-of-control scenarios.

Joint work with Wolfgang Schmid, Patrícia Ferreira Ramos, Taras Lazariv, António Pacheco.

- 13/12/2016, 11:00 — 12:00 — Room P3.10, Mathematics Building

Alexandra Ramos,*Faculdade de Economia da Universidade do Porto* -
### Modelling extremal temporal dependence in stationary time series

Extreme value theory concerns the statistical study of the extremal properties of random processes. The most common problems treated by extreme value methods involve modeling the tail of an unknown distribution function from a set of observed data with the purpose of quantifying the frequency and severity of events more extreme than any that have been observed previously. A fundamental issue in applied multivariate extreme value (MEV) analysis is modelling dependence within joint tail regions. In this seminar we suggest modelling joint tails of the distribution of two consecutive pairs $(X_i;X_{i+1})$ of a first-order stationary Markov chain by a dependence model described in Ramos and Ledford (2009). Applications of this modelling approach to real data are then considered.

Ramos and Ledford (2009). A new class of models for bivariate joint tails. J. R. Statist. Soc., B. 71. p. 219-241.

- 22/11/2016, 11:00 — 12:00 — Room P3.10, Mathematics Building

Sónia Gouveia,*Institute of Electronics and Informatics Engineering and Centre for R&D in Mathematics and Applications, University of Aveiro, Portugal* -
### Binary autoregressive geometric modelling in a DNA context

Symbolic sequences occur in many contexts and can be characterized e.g. by integer-valued intersymbol distances or binary-valued indicator sequences. The analysis of these numerical sequences often sheds light on the properties of the original symbolic sequences. This talk introduces new statistical tools to explore the autocorrelation structure in indicator sequences and to evaluate its impact on the probability distribution of intersymbol distances. The methods are illustrated with data extracted from mitochondrial DNA sequences.

This is a joint work with Manuel Scotto (IST, Lisbon, Portugal), Christian Weiss (Helmut Schmidt University, Hamburg, Germany) and Paulo Ferreira (DETI, IEETA, Aveiro, Portugal).

- 08/11/2016, 11:00 — 12:00 — Room P3.10, Mathematics Building

Laurens de Haan,*Erasmus University Rotterdam and CEAUL* -
### On the peaks-over-threshold method in extreme value theory

The origin, the development and the use of the peaks-over-threshold method (in particular in higher-dimensional spaces) will be discussed as well as some issues that need clarification.

- 25/10/2016, 11:00 — 12:00 — Room P3.10, Mathematics Building

Wolfgang Schmid,*European University, Frankfurt (Oder), Germany* -
### Spatial and Spatio-Temporal Nonlinear Time Series

In this talk we present a new spatial model that incorporates heteroscedastic variance depending on neighboring locations. The proposed process is regarded as the spatial equivalent to the temporal autoregressive conditional heteroscedasticity (ARCH) model. We show additionally how the introduced spatial ARCH model can be used in spatiotemporal settings. In contrast to the temporal ARCH model, in which the distribution is known given the full information set of the prior periods, the distribution is not straightforward in the spatial and spatiotemporal setting. However, it is possible to estimate the parameters of the model using the maximum-likelihood approach. Via Monte Carlo simulations, we demonstrate the performance of the estimator for a specific spatial weighting matrix. Moreover, we combine the known spatial autoregressive model with the spatial ARCH model assuming heteroscedastic errors. Eventually, the proposed autoregressive process is illustrated using an empirical example. Specifically, we model lung cancer mortality in 3108 U.S. counties and compare the introduced model with two benchmark approaches.

(joint work with Robert Gartho and Philipp Otto)

- 11/10/2016, 11:00 — 12:00 — Room P3.10, Mathematics Building

Cláudia Soares,*Institute for Systems and Robotics* -
### Distributed and robust network localization

Signal processing over networks has been a broad and hot topic in the last few years. In most applications networks of agents typically rely on known node positions, even if the main goal of the network is not localization. Also, mobile agents need localization for, e.g., motion planning, or formation control, where GPS might not be an option. Also, real-world conditions imply noisy environments, and the network real-time operation calls for fast and reliable estimation of the agents’ locations. So, galvanized by the compelling applications researchers have dedicated a great amount of work to finding the nodes in networks. With the growing network sizes of devices constrained in energy expenditure and computation power, the need for simple, fast, and distributed algorithms for network localization spurred this work. Here, we approach the problem starting from minimal data collection, aggregating only range measurements and a few landmark positions. We explore tailored solutions recurring to the optimization and probability tools that can leverage performance under noisy and unstructured environments. Thus, the contributions are, mainly:- Distributed localization algorithms characterized for their simplicity but also strong guarantees;
- Analyses of convergence, iteration complexity, and optimality bounds for the designed procedures;
- Novel majorization approaches which are tailored to the specific problem structure.

- 27/09/2016, 11:00 — 12:00 — Room P3.10, Mathematics Building

Manuel Cabral Morais,*DM-IST; CEMAT* -
### An ARL-unbiased $np-$chart

We usually assume that counts of nonconforming items have a binomial distribution with parameters $(n,p)$, where $n$ and $p$ represent the sample size and the fraction nonconforming, respectively.

The non-negative, discrete and usually skewed character and the target mean $(np_0)$ of this distribution may prevent the quality control engineer to deal with a chart to monitor $p$ with: a pre-specified in-control average run length (ARL), say $1/\alpha$; a positive lower control limit; the ability to control not only increases but also decreases in $p$ in a expedient fashion. Furthermore, as far as we have investigated, the $np-$ and $p-$charts proposed in the Statistical Process Control literature are ARL-biased, in the sense that they take longer, in average, to detect some shifts in the fraction nonconforming than to trigger a false alarm.

Having all this in mind, this paper explores the notions of uniformly most powerful unbiased tests with randomization probabilities to eliminate the bias of the ARL function of the $np-$chart and to bring its in-control ARL exactly to $1/\alpha$.

- 25/05/2016, 16:00 — 17:00 — Room P4.35, Mathematics Building

Ana Ferreira,*DM-IST; CEAUL & CEMAT* -
### The Block Maxima and POT methods and, an extension of POT to integrated stochastic processes

We shall review the classical maximum domain of attraction condition underlying BM and POT, two fundamental methods in Extreme Value Theory. A theoretical comparison between the methods will be presented.

Afterwards, the maximum domain of attraction condition to spatial context will be discussed. Then a POT-type result for the integral of a stochastic process verifying the maximum domain of attraction condition will be obtained.

- 11/05/2016, 16:00 — 17:00 — Room P3.10, Mathematics Building

Christian Weiss,*Dept. of Mathematics and Statistics, Helmut Schmidt Universität* -
### On Eigenvalues of the Transition Matrix of some Count Data Markov Chains

A stationary Markov chain is uniquely determined by its transition matrix, the eigenvalues of which play an important role for characterizing the stochastic properties of a Markov chain. Here, we consider the case where the monitored observations are counts, i.e., having values in either the full set of non-negative integers, or in a finite set of the form ${0,...,n}$ with a prespecified upper bound $n$. Examples of count data time series as well as a brief survey of some basic count data time series models is provided.

Then we analyze the eigenstructure of count data Markov chains. Our main focus is on so-called CLAR(1) models, which are characterized by having a linear conditional mean, and also on the case of a finite range, where the second largest eigenvalue determines the speed of convergence of the forecasting distributions. We derive a lower bound for the second largest eigenvalue, which often (but not always) even equals this eigenvalue. This becomes clear by deriving the complete set of eigenvalues for several specific cases of CLAR(1) models. Our method relies on the computation of appropriate conditional (factorial) moments.

- 04/05/2016, 16:00 — 17:00 — Room P3.10, Mathematics Building

André Martins,*Unbabel and Instituto de Telecomunicações* -
### From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification

The softmax transformation is a key component of several statistical learning models, encompassing multinomial logistic regression, action selection in reinforcement learning, and neural networks for multi-class classification. Recently, it has also been used to design attention mechanisms in neural networks, with important achievements in machine translation, image caption generation, speech recognition, and various tasks in natural language understanding and computation learning. In this talk, I will describe sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities. After deriving its properties, I will show how its Jacobian can be efficiently computed, enabling its use in a neural network trained with backpropagation. Then, I will propose a new smooth and convex loss function which is the sparsemax analogue of the logistic loss. An unexpected connection between this new loss and the Huber classification loss will be revealed. We obtained promising empirical results in multi-label classification problems and in attention-based neural networks for natural language inference. For the latter, we achieved a similar performance as the traditional softmax, but with a selective, more compact, attention focus.

- 27/04/2016, 16:00 — 17:00 — Room P3.10, Mathematics Building

Maria João Quintão,*CERENA, Instituto Superior Técnico, Universidade de Lisboa* -
### Geostatistical History Matching with Ensemble Updating

In this work, a new history matching methodology is proposed, coupling within the same framework the advantages of using geostatistical sequential simulation and the principles of ensemble Kalman filters: history matching based on ensemble updating. The main idea of this procedure is to use simultaneously the relationship between the petrophysical properties of interest and the dynamical results to update the static properties at each iteration, and to define areas of influence for each well. This relation is established through the experimental non-stationary covariances, computed from the ensemble of realizations. A set of petrophysical properties of interest is generated through stochastic sequential simulation. For each simulated model, we obtain its dynamic responses at the wells locations by running a fluid flow simulator over each single model. Considering the normalized absolute deviation between the dynamic responses and the real dynamic response in each well as state variables, we compute the correlation coefficients of the deviations with each grid cell through the ensemble of realizations. Areas of high correlation coefficients are those where the permeability is more likely to play a key role for the production of that given well. Using a local estimation of the response of the deviations, through a simple kriging process, we update the subsurface property of interest at a given localization.

- 20/04/2016, 16:00 — 17:00 — Room P3.10, Mathematics Building

Manuel Scotto,*CEMAT and Instituto Superior Técnico, Universidade de Lisboa* -
### Statistical Modeling of Integer-valued Time Series: An Introduction

Modeling and predicting the temporal dependence and evolution of low integer-valued time series have attracted a lot of attention over the last years. This is partially due to the increasing availability of relevant high-quality data sets in various fields of applications ranging from finance and economy to medicine and ecology. It is important to stress, however, that there is no a unifying approach applicable to modeling all integer-valued time series and, consequently, the analysis of such time series has to be restricted to special classes of integer-valued models. A useful division of these models can be made as being either observation-driven or parameter-driven models. A suitable class of observation-driven models is the one including models based on thinning operators. Models belonging to this class are obtained by replacing the multiplication in the conventional time series models by an appropriate thinning operator, along with considering a discrete distribution for the sequence of innovations in order to preserve the discreteness of the counts.

This talk aims at providing an overview of recent developments in thinning-based time series models paying particular attention to models obtained as discrete counterparts of conventional univariate and multivariate autoregressive moving average models, with either finite or infinite support. Finally, we also outline and discuss likely directions of future research.

- 13/04/2016, 16:00 — 17:00 — Room P3.10, Mathematics Building

Vanda M. Lourenço,*FCT, Universidade Nova de Lisboa; CEMAT-IST* -
### Robust heritability and predictive accuracy estimation in plant breeding

Genomic prediction is used in plant breeding to help find the best genotypes for selection. Here, the accurate estimation of predictive accuracy (PA) and heritability (H) is essential for genomic selection (GS). As in other applications, field data are analyzed via regression models, which are known to lead to biased estimation when the normality premise is violated, biases that may translate into inaccurate H and PA estimates and negatively impact GS. Therefore, a robust analogue of a method from the literature used for H and PA estimation is presented. Both techniques are then compared through simulation.

(Joint work with Hans-Peter Piepho & Joseph O. Ogutu, Bioinformatics Unit, Institute of Crop Science, University of Hohenheim, Stuttgart, Germany)