Mathematics For Data Analytics Pdf

Figures - uploaded by Frank van der Meulen

Author content

All figure content in this area was uploaded by Frank van der Meulen

Content may be subject to copyright.

Discover the world's research

20+ million members
135+ million publications
700k+ research projects

Join for free

282 NAW 5/19 nr. 4 december 2018 Mathematics for Big Data Alessandro Di Bucchianico et al.

same order of magnitude, and in some ar-

eas even much higher (see [6, 23] ).

In this essay we present several explicit

real-life examples of the mathematics be-

hind Big Data, highlighting the role and im-

portance of speciﬁc areas of mathematics

in these contexts. We show a wide variety

of examples: search engines, virtual proto-

typing in manufacturing, data assimilation,

web data analytics, healthcare, recommen-

dation systems, genomics and other omics

sciences, and precision farming. In this way,

we hope to stimulate mathematicians to

work on topics related to Big Data, as well

as to encourage industries and research-

ers in computer science and other ﬁelds

to collaborate with mathematicians in this

direction.

Similar and more detailed accounts have

appeared at other places, see, e.g., [11, 19],

National Research Council (2013) and the

London Workshop Report on Statistics and

Science (http://bit.ly/londonreport).

to show the importance of mathematics in

Big Data.

The role of mathematics is easy to

overlook and not fully recognized because

technological advances are much more

visible than mathematical advances even

though the latter often have more impact.

Here is a small illustration. It is common

knowledge that the speed-up of comput-

ers due to technological advances follows

Moore's Law: doubling of speed every

eighteen months. However, it is much less

known that the speed-up due to advances

in mathematical methods in scientiﬁc com-

puting and optimization is at least of the

'Big Data' has become a buzz word in the

last decade, both in science and among

the general public. Scientists from all ar-

eas encounter this in the shift of content

and methods in their research as well as

in current scientiﬁc funding programmes.

For example, Big Data is one of the select-

ed routes in the Dutch National Scientiﬁc

Agenda (NWA) and the large funding pro-

gramme Commit2Data has been launched

in the Dutch Digital Delta in 2016.

As the Big Data Team of the 4TU Applied

Mathematics Institute, we feel that math-

ematicians should actively engage in Big

Data activities. It is the goal of this paper

Research

Mathematics

for Big Data

This essay highlights several examples of using mathematics and statistics to analyse

problems involving Big Data. More often than not, mathematics is essential for extracting

usable information from the data. However, it usually remains hidden under the bonnet,

and the general public seems to take it for granted. With this paper Alessandro Di Bucchi-

anico, Laura Iapichino, Nelly Litvak, Frank van der Meulen and Ron Wehrens want to show

some essential contributions of the ﬁelds of mathematics to Big Data using successful

real-life examples.

Alessandro Di Bucchianico

Department of Mathematics and Computer Science

Eindhoven University of Technology

a.d.bucchianico@tue.nl

Laura Iapichino

Department of Mathematics and Computer Science

Eindhoven University of Technology

l.iapichino@tue.nl

Nelly Litvak

Department of Applied Mathematics

University of Twente, and

Department of Mathematics and Computer Science

Eindhoven University of Technology

n.litvak@utwente.nl

Frank van der Meulen

Department of Applied Mathematics

Delft University of Technology

f.h.vandermeulen@tudelft.nl

Ron Wehrens

Biometris

Wageningen University & Research

ron.wehrens@wur.nl

Alessandro Di Bucchianico et al. Mathematics for Big Data NAW 5/19 nr. 4 december 2018 283

the modeling simpliﬁcations (structural un-

certainty) and the uncertainty in knowing

model parameters (parameter uncertainty).

On the other hand, given a complicated

mathematical model, it is important to

know how accurately numerical methods

can approximate speciﬁed outputs from

this model.

The term Uncertainty Quantiﬁcation is

often used as general term for scientiﬁc

research in this area. There exist several

mathematical approaches to study this un-

certainty. One such approach is applying

statistical techniques related to experimen-

tal design for computer experiments like

Latin hypercube sampling and response

surface methods. Another approach is to

cast the mathematical model as a stochas-

tic partial differential equation and try to

solve that. Recent high-level mathematics

combining analysis and stochastics is used

such as perturbation expansion methods

for random ﬁelds, stochastic operator ex-

pansions and polynomial chaos (Wiener

chaos).

Model order reduction (MOR) tech-

niques (see, e.g., [23] ) have been recent-

ly introduced and exploited to overcome

the issue of severe computational times

required for solving mathematical models

of real-life processes. Over the past four

decades, reduced-order models have been

developed aimed at replacing the origi-

nal large-dimension numerical problem

(typically called high-ﬁdelity approxima-

tion) by a reduced problem of substan-

tially smaller dimension. Depending on

the context, there are different strategies

to generate the reduced problem from the

high-ﬁdelity one, e.g., Krylov subspace

based methods, moment matching tech-

niques, proper orthogonal decomposi-

tion, balanced truncation, reduced basis

methods. Very short CPU times and lim-

ited storage capacities demanded today

by MOR methods allow to tackle a wide

range of problems arising in engineering,

computational science, and physical and

biological sciences.

Data assimilation

Weather forecasting, for some people the

main reason to watch the news, is a data-

intensive computational problem with many

economic implications (agriculture, hospi-

tality business, airlines, healthcare, large

public events). The change over time of

measurable atmospheric quantities can be

huge, at the moment it would have hun-

dreds of billions of rows and columns. In

the beginning of this century, major speed

gains were achieved due to sophisticated

new methods from, mainly, linear alge-

bra [5]. Another interesting mathematical

and practical problem is the vulnerability

of PageRank to deliberate manipulations,

such as link farms created intentionally to

boost the PageRank.

If we want to predict effectiveness of

ranking, it is also important to understand

its relation to the network structure. Can

we predict the largest PageRank, inves-

tigate its stability, pick up a signal from

hidden communities? Can we use ranking

to detect important changes in the net-

work structure? A lot of empirical results

are available but they do not answer these

questions in sufﬁcient generality. To solve

these and other problems we need to de-

velop new approaches in probability theo-

ry and the theory of random graphs (see

e.g. [9] ).

Virtual prototyping in manufacturing

High development costs in industry have

led many manufacturers to replace build-

ing and testing physical prototypes by

virtual prototyping, i.e., testing using

large-scale simulations of extensive math-

ematical models based on physical princi-

ples. Speciﬁc examples are the automotive

industry and the aircraft industry (see, e.g.,

the Virtual Hybrid Testing Framework of Air-

bus). Such simulations should be handled

with care since there is uncertainty in the

outcomes due to both model limitations

and the numerical accuracy of the simula-

tions, often requiring solving large systems

of differential equations. On the one hand

there is uncertainty due to replacing phys-

ical reality by a mathematical model. This

involves both the uncertainty caused by

Search engines

The quality of a search engine depends

greatly on ranking algorithms that deﬁne

in which order web pages will appear for

the user. This is indeed crucial because

most of us do not go beyond the ﬁrst

page of search results. Google's PageRank,

at the very heart of the success of Google,

was the ﬁrst and most famous ranking al-

gorithm.

The revolutionary idea of Google was

that the importance of a web page de-

pends on quantity, but also on quality of

links that point to this page. This can be

seen on a small example from Wikipedia

in Figure 1.

The size of the nodes represents their

PageRank score. Node B has a large Page-

Rank because it has many incoming links.

The PageRank of node C is high because

it received the only outgoing link from

the important node B. Mathematically, the

World Wide Web is modelled as a graph

with pages as nodes and hyperlinks as di-

rected edges, and then a large set of equa-

tions is solved to ﬁnd the PageRank values

for each node in the graph.

Right after PageRank was introduced,

its fast computation became a problem of

great interest because the Google matrix is

Figure 1 PageRank, example from Wikipedia.

Pagerank

'Easily bored' surfer. Consider a simple model of a surfer browsing web pages. With

probability a , the surfer follows a randomly chosen outgoing link of a page, and with

probability

the surfer is bored and jumps to a random page. Initially, Google

used

. PageRank of a page is the stationary (long-run) probability that the

surfer visits this page.

Eigenvector. Equivalently, PageRank is the so-called dominant left eigenvector of the

transition matrix of the above process: the entry

of this matrix is the probability

that the surfer on page i will proceed to page j. Such an eigenvector is unique. The

PageRank of a web page is the corresponding component of this unique dominant

left eigenvector.

284 NAW 5/19 nr. 4 december 2018 Mathematics for Big Data Alessandro Di Bucchianico et al.

turned out that the distance (the number

of hops along the edges of the Facebook

graph) between two Facebook users is on

average less than 4!

Healthcare

Medical devices like MRI scanners obtain

large image data at relatively low velocity.

Efforts are undertaken to reduce the time it

takes to makes scans (typically thirty min-

utes) since hospitals could obtain higher

efﬁciency of the expensive MRI equipment

and patients would suffer less from the un-

pleasant high noise levels. Making scans at

a lower resolution is not an option because

of medical reasons. An MRI scan uses mag-

netic ﬁelds to order the spins of hydrogen

atoms and radio waves to disturb these

spins. When the spins return to their origi-

nal position, energy is emitted. This energy

is measured so that one gets an indication

of the amount of tissue. Using magnetic

gradients it is possible to localize these

measurements.

The mathematical bottom line of this

procedure is that MRI scans produce Fouri-

er coefﬁcients one by one. Traditional ap-

proaches to reconstruction algorithms can-

not yield the desired reduction of scanning

time because of the so-called Nyquist–

Shannon criterion. Again, advanced math-

ematical techniques have provided the

breakthrough. The basic idea is to project

Probability theory has been essential in

developing algorithms such as Count-Min

Sketch, MinHash and HyperLogLog that use

random hash functions to store answers.

Such algorithms may be accurate within

2% while using only memory in the order

of the (iterated) logarithm of the original

sample size. An important issue in devel-

oping in these algorithms is to control the

variance of the estimators, in order to get

consistently accurate estimates.

HyperLogLog is one of the most ele-

gant mathematical solutions for counting

distinct objects in Big Data applications,

widely used in practice. Researchers at

Google [15] state that Google's data anal-

ysis system PowerDrill routinely performs

about ﬁve million 'count distinct objects'

computations per day. In about one hun-

dred cases, the resulting number is greater

than one billion. In 2014 HyperLogLog was

implemented by Amazon's data structure

store Redis as well. An interesting human

interest note: the commands of HyperLo-

gLog begin with PF - the initials of the

French mathematician Philippe Flajolet who

developed this algorithm (see, e.g., [12] ).

Maybe even more exciting from a scien-

tiﬁc point of view was the result in [3]

where HyperLogLog was used to accom-

plish an incredible task of computing av-

erage distances in the complete Facebook

graph of more than 700 million nodes. It

described in terms of dynamical systems,

transferring information in time-ordered

observed data to a physical model of the

system. This process is often referred to

as data-assimilation. Its development has

been highly inﬂuenced by professionals

working in the atmospheric and oceano-

graphic sciences. When discretized in

space, a typical model for numerical weath-

er prediction is a differential equation sys-

tem with dimension of order 109 [18]. The

state variable of the dynamical system may

represent unknown quantities such as for

example velocity, temperature and pres-

sure at a grid of locations.

The application of mathematical models

to large dynamic data sets has naturally

popped up in many other communities as

well. Within signal processing recovering

the unknown state of the dynamical system

is known as ﬁltering or smoothing, where

the ﬁrst term refers to online recovery (as

opposed to static recovery). Probabilists

and statisticians usually speak of state and

parameter estimation. Over the past thirty

years there has been tremendous progress

for this kind of problems. Under speciﬁc

assumptions on the dynamical system

computationally efﬁcient methods such as

the (ensemble) Kalman ﬁlter can be used.

In more general settings, a Bayesian for-

mulation of the problem and application

of Markov Chain Monte Carlo methods

and Sequential Monte Carlo methods can

be exploited (see, e.g., [21, 22] ). Where-

as these methods are presently not yet

applicable to weather forecasting, they

have proved to be powerful in simpli-

ﬁed problems of less demanding dimen-

sions and constitute a very active area of

research [8, 17].

Web data analytics

Many companies collect large amounts of

customer data through their web services.

However, having these data does not mean

that we already know everything. Even

simple tasks like counting the number of

distinct records in a large customer da-

tabase (e.g., the number of distinct cus-

tomers that use a certain service) requires

advanced mathematics. The exact counting

is computationally prohibitive mainly be-

cause we cannot keep all objects in the

restricted working memory of a computer.

However, we might not need that level of

accuracy — in such cases it is often sufﬁ-

cient to work with approximate estimates.

HyperLogLog

Hash functions. Each digital object is converted to a sequence of zero's and one's

using hash functions. On a set of different objects a good hash-function appears as if

randomly generated: zero's and one's have probability

, independently of each other.

Count zero's. The idea of LogLog-type algorithms is to sweep through objects keeping

in the memory only the largest number of zero's at the beginning hash functions. For

example, if we observed

00101, 10011, 01010,

we will remember 2, the largest number of zeros. Roughly, the probability to see 2

zero's followed by one at the beginning of the hash function is

$$=

so we conclude that we saw approximately 8 objects!

HyperLogLog. In this form, the estimation is obviously too rough, so it cannot be di-

rectly used in practice. A lot of mathematics went into making the result more precise.

This includes dividing hash functions into registers, using different corrections for small

and large samples, harmonic averages. All these ideas are included in HyperLogLog,

ensuring its applicability. Further improvements are possible, e.g., this was the goal

of the paper [15].

Why LogLog? Assume we have N objects. Then hash functions have length

Hence, the number of zero's is a number between 0 and

, so we need only

gl og N

bits of memory to remember this number.

Alessandro Di Bucchianico et al. Mathematics for Big Data NAW 5/19 nr. 4 december 2018 285

to certain traits or treatment effects. Net-

work analysis is getting more and more at-

tention (see, e.g., [20] ) as a means to bring

experimental results into the realm of the

things we already know about the biology

of the system — one of the main challenges

is to combine the different omics data lay-

ers into coherent models that explain the

behaviour of the system under study [14].

Precision farming

Agriculture is rapidly becoming a data-rich

environment, tractors currently being con-

nected to the Internet 24/7 and resembling

computers on (large) wheels rather than

the dusty and primitive muscle-machines

they were in the 20th century. As a result,

new questions can be addressed that were

unthinkable only ten, twenty years ago:

by combining several different information

sources (satellite images, plant growth

models, management data on plot level)

the farmer can, e.g., try to devise optimal

strategies to deliver the right amount of

water and nutrients to his land and in this

way obtain the highest possible yield (see,

e.g., [4, 10] and many others).

Here, the problems are the typical

big-data problems: even assuming one has

access to all databases and knows how to

read and use the data, it is not a trivial

question how to combine data with very

different characteristics, found in different

locations and measured for different pur-

poses. One thing is certain: mathematics

and statistics play a pivotal role.

Genomics and other omics sciences

Now that technology has become avail-

able (and affordable!) to rapidly obtain in-

formation about the genetic composition

of biological samples, huge quantities of

data are generated routinely. This is not

only true when looking at genetic informa-

tion (hence the term genomics) but also

when looking at proteins (proteomics) and

metabolites (metabolomics), to name just

two other members of the 'omics' fami-

ly. The Big Data aspect here refers to the

huge amount of information that we have

on a relatively small number of subjects. A

typical example is genetic information on

humans, animals or plants that consists of

millions of measurements (data points) for

each subject. The resulting 'high-dimen-

sional' data require the development of

new statistical techniques to draw correct

conclusions because traditional statistical

methods for such data lead to an unac-

ceptable high number of false positives

(see, e.g., [7] ).

Furthermore, advanced data processing

methods are needed to convert the mea-

sured data into information — one example

is the BLAST algorithm [2], incidentally also

the most highly cited paper of the nineties)

to align sequences of nucleotides or amino

acids with database entries. In each case

we are confronted with the issue mentioned

before: we know an awful lot about very few

samples, which makes statistical analysis

extremely hard. Typical questions are ﬁnd-

ing genes, proteins or metabolites related

the observed data onto a smaller sub-

space using sparsity in the data. Remark-

ably, random projections yield sampling

strategies and reconstruction algorithms

that outperform traditional signal process-

ing techniques. These methods are known

under the name compressed sensing. For

other applications of compressed sens-

ing in healthcare, we refer to https://www.

healthcare.siemens.nl/magnetic-resonance-

imaging/clinical-specialities/compressed-

sensing.

Compressed sensing has been applied

successfully in a wide range of other tasks

as well, including network tomography,

electron microscopy, and facial recognition.

Recommender systems

Webshops like Amazon analyze the buying

behaviour of their customers and present

visitors of the Amazon website with recom-

mendations of books and other items that

may be of interest. In a similar way Netﬂix

gives suggestions for movies to its custom-

ers. A way to provide such recommenda-

tions is to set up a matrix of user ratings

of movies (columns are ratings, rows are

users). Of course, such a matrix has many

empty entries since there are many more

movies (Netﬂix has around 20,000) than

people can see and rate.

The idea behind the recommend-

er systems is that there are relatively

few 'latent' features that drive our pref-

erences (a sparsity principle). That is,

there are a few typical items (books or

movies) and a few typical users. Trans-

lated into matrices, this means looking

for a nonnegative matrix factorization of

the preference matrix. This means that a

very large and sparse preference matrix is

presented as a product of two matrices

with much lower dimensions. Although

computers become faster, this is main-

ly increase in CPU speed and much less

in faster memory. Factorizations of large

matrices, however, require a huge amount

of communication between working mem-

ory and storage memory. There is thus a

need for memory efﬁcient factorization

algorithms that go far beyond traditional

factorization algorithms for singular value

decompositions (see, e.g., [16] ) for a tech-

nical account by the team that won the

One Million Dollar Netﬂix competition). An

exciting new approach in this ﬁeld is the

use of randomized methods like stochas-

tic gradient algorithms (see [1]). Figure 2 Smart agriculture.

Photo: Shutterstock, MONOPOLY919

286 NAW 5/19 nr. 4 december 2018 Mathematics for Big Data Alessandro Di Bucchianico et al.

Institute of Mathematical Statistics presi-

dential address:

"Work on real problems, relevant theory

will follow."

(see http://bulletin.imstat.org/2014/10/ims-

presidential-address-let-us-own-data-science).

Hence the stress on the applications in this

paper: mathematics needs them, just like

the applications need mathematics. s

did not even expect to be solvable. It is

important to realize that advances in this

area have both a push and a pull com-

ponent: without being confronted with

real-life problems we might lack the incen-

tive or the direction to pursue promising

avenues, but without fundamental knowl-

edge we simply lack the tools to tackle the

problems successfully. This was expressed

in a concise way by Bin Yu in her 2014

Conclusion

Mathematics and statistics, being extreme-

ly generic tools, have played an important

part in technological and scientiﬁc develop-

ments over the last centuries, and will con-

tinue to do so also in this Big Data era. Not

only will they contribute to solving prob-

lems faster and more efﬁciently, they will

expand our horizon, exposing questions

that we never thought about and maybe

1 C. C. Aggarwal, Recommender Systems, Sprin-

ger, 2016

2 S. Altschul, W. Gish, W. Miller, E. Myers and

D. Lipman, Basic local alignment search

tool, Journal of Molecular Biology 215(3)

(1990), 403–410.

3 L. Backstrom, P. Boldi, M. Rosa, J. Ugander

and S. Vigna, Four degrees of separation,

Proceedings of the 4th Annual ACM Web

Science Conference, 2012, pp. 33–42.

4 J. Behmann, A. Mahlein, T. Rumpf, C. Römer

and L. Plümer, A review of advanced ma-

chine learning methods for the detection of

biotic stress in precision crop protection,

Precision Agriculture 16 (2015), 239–260.

5 P. Berkhin, A Survey on PageRank Computing,

Internet Mathematics 2(1) (2015), 73–120.

6 R. E. Bixby, A brief history of linear and

mixed-integer programming computation,

Documenta Mathematica (2012), 107–121.

7 P. Bühlmann and S. A. Van De Geer, Statistics

for High-dimensional Data: Methods, Theo-

ry and Applications, Springer, 2013.

8 A. Cuzol and E. A. Memin, Stochastic ﬁlter-

ing technique for ﬂuid ﬂow velocity ﬁelds

tracking, IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence 31(7) (2009),

1278–1293.

9 N. Chen, N. Litvak and M. Olvera Cravioto,

Generalized PageRank on directed conﬁgu-

ration networks, Random Structures & Algo-

rithms 51(2) (2017), 237–274.

10 D. E. Clay, S. A. Clay and S. A. Bruggeman,

eds., Practical Mathematics for Precision

Farming, ASA, CSSA and SSSA, 2017.

11 J. Fan, F. Han and H. Liu, Challenges of Big

Data Analysis, National Science Review 1(2)

(2014), 293–314.

12 P. Flajolet, E. Fusy, G. Olivier and F. Meuni-

er, Hyperloglog: The analysis of a near-op-

timal cardinality estimation algorithm, in

AofA'07: Proceedings of the 2007 Interna-

tional Conference on Analysis of Algorithms,

2007.

13 Frontiers in Massive Data Analysis, National

Academies Press, 2013.

14 R. D. Hawkins, G. C. Hon and B. Ren, Next-

generation genomics: an integrative ap-

proach, Nature Reviews Genetics 11 (2010),

476–486.

15 S. Heule, M. Nunkesser and A. Hall, Hyper-

LogLog in practice: Algorithmic engineering

of a state of the art cardinality estimation

algorithm, Proceedings of the 16th Interna-

tional Conference on Extending Database

Technology, 2013, pp. 683–692.

16 Y. Koren, R. Bell and C. Volinsky, Matrix fac-

torization techniques for recommender sys-

tems, Computer 8 (2009), 42–49.

17 K. J. H. Law and A. M. Stuart, Evaluating data

assimilation algorithms, Monthly Weather

Review 140 (2012), 3757–3782.

18 K. J. H. Law, A. M. Stuart and K. C. Zygalakis,

Data Assimilation. A Mathematical Introduc-

tion, Springer Texts in Applied Mathematics,

Vol. 62, Springer, 2015.

19 B. G. Lindsay, J. Kettenring and D. O. Sieg-

mund, A Report on the Future of Statistics,'

Statistical Science 19(3) (2004), 387–413.

20 K. Mitra, A. R. Carvunis, S. K. Ramesh and

T. Ideker, Integrative approaches for ﬁnding

modular structure in biological networks,

Nat. Rev. Genet. 14 (2013), 719–732.

21 C. P. Robert and G. Casella, Monte Carlo Sta-

tistical Methods, Springer Texts in Statistics,

Springer, 2004, 2nd edition.

22 S. Särkkä, Bayesian Filtering and Smooth-

ing, Cambridge University Press, 2013.

23 W. Schilders, Introduction to Model Order

Reduction, in Mathematics in Industry Mod-

el Order Reduction: Theory, Research As-

pects and Applications, Springer, 2008, pp.

3–32.

References

ResearchGate has not been able to resolve any citations for this publication.

Wil Schilders

In this first section we present a high level discussion on computational science, and the need for compact models of phenomena observed in nature and industry. We argue that much more complex problems can be addressed by making use of current computing technology and advanced algorithms, but that there is a need for model order reduction in order to cope with even more complex problems. We also go into somewhat more detail about the question as to what model order reduction is.

Kody Law
Andrew M. Stuart

Data assimilation leads naturally to a Bayesian formulation in which the posterior probability distribution of the system state, given the observations, plays a central conceptual role. The aim of this paper is to use this Bayesian posterior probability distribution as a gold standard against which to evaluate various commonly used data assimilation algorithms. A key aspect of geophysical data assimilation is the high dimensionality and low predictability of the computational model. With this in mind, yet with the goal of allowing an explicit and accurate computation of the posterior distribution, we study the 2D Navier-Stokes equations in a periodic geometry. We compute the posterior probability distribution by state-of-the-art statistical sampling techniques. The commonly used algorithms that we evaluate against this accurate gold standard, as quantified by comparing the relative error in reproducing its moments, are 4DVAR and a variety of sequential filtering approximations based on 3DVAR and on extended and ensemble Kalman filters. The primary conclusions are that: (i) with appropriate parameter choices, approximate filters can perform well in reproducing the mean of the desired probability distribution; (ii) however they typically perform poorly when attempting to reproduce the covariance; (iii) this poor performance is compounded by the need to modify the covariance, in order to induce stability. Thus, whilst filters can be a useful tool in predicting mean behavior, they should be viewed with caution as predictors of uncertainty. These conclusions are intrinsic to the algorithms and will not change if the model complexity is increased, for example by employing a smaller viscosity, or by using a detailed NWP model.

Christian P. Robert
George Casella

La simulation est devenue dans la dernière décennie un outil essentiel du traitement statistique de modèles complexes et de la mise en oeuvre de techniques statistiques avancées, comme le bootstrap ou les méthodes d'inférence simulée. Ce livre présente les éléments de base de la simulation de lois de probabilité (génération de variables uniformes et de lois usuelles) et de leur utilisation en Statistique (intégration de Monte Carlo, optimisation stochastique). Après un bref rappel sur les chaînes de Markov, les techniques plus spécifiques de Monte Carlo par chaînes de Markov (MCMC) sont présentées en détail, à la fois du point de vue théorique (validité et convergence) et du point de vue de leur implémentation (accélération, choix de paramètres, limitations). Les algorithmes d'échantillonnage de Gibbs sont ainsi distingués des méthodes générales de Hastings-Metropolis par leur plus grande richesse théorique. Les derniers chapitres contiennent un exposé critique sur l'état de l'art en contrôle de convergence de ces algorithmes et une présentation unifiée des diverses applications des méthodes MCMC aux modèles à données manquantes. De nombreux exemples statistiques illustrent les méthodes présentées dans cet ouvrage destiné aux étudiants de deuxième et troisième cycles universitaires en Mathématiques Appliquées ainsi qu'aux chercheurs et praticiens désirant utiliser les méthodes MCMC. Monte Carlo statistical methods, particularly those based on Markov chains, are now an essential component of the standard set of techniques used by statisticians. This new edition has been revised towards a coherent and flowing coverage of these simulation techniques, with incorporation of the most recent developments in the field. In particular, the introductory coverage of random variable generation has been totally revised, with many concepts being unified through a fundamental theorem of simulation There are five completely new chapters that cover Monte Carlo control, reversible jump, slice sampling, sequential Monte Carlo, and perfect sampling. There is a more in-depth coverage of Gibbs sampling, which is now contained in three consecutive chapters. The development of Gibbs sampling starts with slice sampling and its connection with the fundamental theorem of simulation, and builds up to two-stage Gibbs sampling and its theoretical properties. A third chapter covers the multi-stage Gibbs sampler and its variety of applications. Lastly, chapters from the previous edition have been revised towards easier access, with the examples getting more detailed coverage. This textbook is intended for a second year graduate course, but will also be useful to someone who either wants to apply simulation techniques for the resolution of practical problems or wishes to grasp the fundamental principles behind those methods. The authors do not assume familiarity with Monte Carlo techniques (such as random variable generation), with computer programming, or with any Markov chain theory (the necessary concepts are developed in Chapter 6). A solutions manual, which covers approximately 40% of the problems, is available for instructors who require the book for a course. oui

Anne Cuzol
Étienne Mémin

In this paper, we present a method for the temporal tracking of fluid flow velocity fields. The technique we propose is formalized within a sequential Bayesian filtering framework. The filtering model combines an Itô diffusion process coming from a stochastic formulation of the vorticity-velocity form of the Navier-Stokes equation and discrete measurements extracted from the image sequence. In order to handle a state space of reasonable dimension, the motion field is represented as a combination of adapted basis functions, derived from a discretization of the vorticity map of the fluid flow velocity field. The resulting nonlinear filtering problem is solved with the particle filter algorithm in continuous time. An adaptive dimensional reduction method is applied to the filtering technique, relying on dynamical systems theory. The efficiency of the tracking method is demonstrated on synthetic and real-world sequences.

A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

This paper studies the distribution of a family of rankings, which includes Google's PageRank, on a directed configuration model. In particular, it is shown that the distribution of the rank of a randomly chosen node in the graph converges in distribution to a finite random variable that can be written as a linear combination of i.i.d. copies of the attracting endogenous solution to a stochastic fixed‐point equation of the form where is a real‐valued vector with , and the are i.i.d. copies of , independent of . Moreover, we provide precise asymptotics for the limit , which when the in‐degree distribution in the directed configuration model has a power law imply a power law distribution for with the same exponent. © 2016 Wiley Periodicals, Inc. Random Struct. Alg., 51, 237–274, 2017

Effective crop protection requires early and accurate detection of biotic stress. In recent years, remarkable results have been achieved in the early detection of weeds, plant diseases and insect pests in crops. These achievements are related both to the development of non-invasive, high resolution optical sensors and data analysis methods that are able to cope with the resolution, size and complexity of the signals from these sensors. Several methods of machine learning have been utilized for precision agriculture such as support vector machines and neural networks for classification (supervised learning); k-means and self-organizing maps for clustering (unsupervised learning). These methods are able to calculate both linear and non-linear models, require few statistical assumptions and adapt flexibly to a wide range of data characteristics. Successful applications include the early detection of plant diseases based on spectral features and weed detection based on shape descriptors with supervised or unsupervised learning methods. This review gives a short introduction into machine learning, analyses its potential for precision crop protection and provides an overview of instructive examples from different fields of precision agriculture.

A central goal of systems biology is to elucidate the structural and functional architecture of the cell. To this end, large and complex networks of molecular interactions are being rapidly generated for humans and model organisms. A recent focus of bioinformatics research has been to integrate these networks with each other and with diverse molecular profiles to identify sets of molecules and interactions that participate in a common biological function - that is, 'modules'. Here, we classify such integrative approaches into four broad categories, describe their bioinformatic principles and review their applications.