Mathematics For Data Analytics Pdf
Figures - uploaded by Frank van der Meulen
Author content
All figure content in this area was uploaded by Frank van der Meulen
Content may be subject to copyright.
Discover the world's research
- 20+ million members
- 135+ million publications
- 700k+ research projects
Join for free
282 NAW 5/19 nr. 4 december 2018 Mathematics for Big Data Alessandro Di Bucchianico et al.
same order of magnitude, and in some ar-
eas even much higher (see [6, 23] ).
In this essay we present several explicit
real-life examples of the mathematics be-
hind Big Data, highlighting the role and im-
portance of specific areas of mathematics
in these contexts. We show a wide variety
of examples: search engines, virtual proto-
typing in manufacturing, data assimilation,
web data analytics, healthcare, recommen-
dation systems, genomics and other omics
sciences, and precision farming. In this way,
we hope to stimulate mathematicians to
work on topics related to Big Data, as well
as to encourage industries and research-
ers in computer science and other fields
to collaborate with mathematicians in this
direction.
Similar and more detailed accounts have
appeared at other places, see, e.g., [11, 19],
National Research Council (2013) and the
London Workshop Report on Statistics and
Science (http://bit.ly/londonreport).
to show the importance of mathematics in
Big Data.
The role of mathematics is easy to
overlook and not fully recognized because
technological advances are much more
visible than mathematical advances even
though the latter often have more impact.
Here is a small illustration. It is common
knowledge that the speed-up of comput-
ers due to technological advances follows
Moore's Law: doubling of speed every
eighteen months. However, it is much less
known that the speed-up due to advances
in mathematical methods in scientific com-
puting and optimization is at least of the
'Big Data' has become a buzz word in the
last decade, both in science and among
the general public. Scientists from all ar-
eas encounter this in the shift of content
and methods in their research as well as
in current scientific funding programmes.
For example, Big Data is one of the select-
ed routes in the Dutch National Scientific
Agenda (NWA) and the large funding pro-
gramme Commit2Data has been launched
in the Dutch Digital Delta in 2016.
As the Big Data Team of the 4TU Applied
Mathematics Institute, we feel that math-
ematicians should actively engage in Big
Data activities. It is the goal of this paper
Research
Mathematics
for Big Data
This essay highlights several examples of using mathematics and statistics to analyse
problems involving Big Data. More often than not, mathematics is essential for extracting
usable information from the data. However, it usually remains hidden under the bonnet,
and the general public seems to take it for granted. With this paper Alessandro Di Bucchi-
anico, Laura Iapichino, Nelly Litvak, Frank van der Meulen and Ron Wehrens want to show
some essential contributions of the fields of mathematics to Big Data using successful
real-life examples.
Alessandro Di Bucchianico
Department of Mathematics and Computer Science
Eindhoven University of Technology
a.d.bucchianico@tue.nl
Laura Iapichino
Department of Mathematics and Computer Science
Eindhoven University of Technology
l.iapichino@tue.nl
Nelly Litvak
Department of Applied Mathematics
University of Twente, and
Department of Mathematics and Computer Science
Eindhoven University of Technology
n.litvak@utwente.nl
Frank van der Meulen
Department of Applied Mathematics
Delft University of Technology
f.h.vandermeulen@tudelft.nl
Ron Wehrens
Biometris
Wageningen University & Research
ron.wehrens@wur.nl
Alessandro Di Bucchianico et al. Mathematics for Big Data NAW 5/19 nr. 4 december 2018 283
the modeling simplifications (structural un-
certainty) and the uncertainty in knowing
model parameters (parameter uncertainty).
On the other hand, given a complicated
mathematical model, it is important to
know how accurately numerical methods
can approximate specified outputs from
this model.
The term Uncertainty Quantification is
often used as general term for scientific
research in this area. There exist several
mathematical approaches to study this un-
certainty. One such approach is applying
statistical techniques related to experimen-
tal design for computer experiments like
Latin hypercube sampling and response
surface methods. Another approach is to
cast the mathematical model as a stochas-
tic partial differential equation and try to
solve that. Recent high-level mathematics
combining analysis and stochastics is used
such as perturbation expansion methods
for random fields, stochastic operator ex-
pansions and polynomial chaos (Wiener
chaos).
Model order reduction (MOR) tech-
niques (see, e.g., [23] ) have been recent-
ly introduced and exploited to overcome
the issue of severe computational times
required for solving mathematical models
of real-life processes. Over the past four
decades, reduced-order models have been
developed aimed at replacing the origi-
nal large-dimension numerical problem
(typically called high-fidelity approxima-
tion) by a reduced problem of substan-
tially smaller dimension. Depending on
the context, there are different strategies
to generate the reduced problem from the
high-fidelity one, e.g., Krylov subspace
based methods, moment matching tech-
niques, proper orthogonal decomposi-
tion, balanced truncation, reduced basis
methods. Very short CPU times and lim-
ited storage capacities demanded today
by MOR methods allow to tackle a wide
range of problems arising in engineering,
computational science, and physical and
biological sciences.
Data assimilation
Weather forecasting, for some people the
main reason to watch the news, is a data-
intensive computational problem with many
economic implications (agriculture, hospi-
tality business, airlines, healthcare, large
public events). The change over time of
measurable atmospheric quantities can be
huge, at the moment it would have hun-
dreds of billions of rows and columns. In
the beginning of this century, major speed
gains were achieved due to sophisticated
new methods from, mainly, linear alge-
bra [5]. Another interesting mathematical
and practical problem is the vulnerability
of PageRank to deliberate manipulations,
such as link farms created intentionally to
boost the PageRank.
If we want to predict effectiveness of
ranking, it is also important to understand
its relation to the network structure. Can
we predict the largest PageRank, inves-
tigate its stability, pick up a signal from
hidden communities? Can we use ranking
to detect important changes in the net-
work structure? A lot of empirical results
are available but they do not answer these
questions in sufficient generality. To solve
these and other problems we need to de-
velop new approaches in probability theo-
ry and the theory of random graphs (see
e.g. [9] ).
Virtual prototyping in manufacturing
High development costs in industry have
led many manufacturers to replace build-
ing and testing physical prototypes by
virtual prototyping, i.e., testing using
large-scale simulations of extensive math-
ematical models based on physical princi-
ples. Specific examples are the automotive
industry and the aircraft industry (see, e.g.,
the Virtual Hybrid Testing Framework of Air-
bus). Such simulations should be handled
with care since there is uncertainty in the
outcomes due to both model limitations
and the numerical accuracy of the simula-
tions, often requiring solving large systems
of differential equations. On the one hand
there is uncertainty due to replacing phys-
ical reality by a mathematical model. This
involves both the uncertainty caused by
Search engines
The quality of a search engine depends
greatly on ranking algorithms that define
in which order web pages will appear for
the user. This is indeed crucial because
most of us do not go beyond the first
page of search results. Google's PageRank,
at the very heart of the success of Google,
was the first and most famous ranking al-
gorithm.
The revolutionary idea of Google was
that the importance of a web page de-
pends on quantity, but also on quality of
links that point to this page. This can be
seen on a small example from Wikipedia
in Figure 1.
The size of the nodes represents their
PageRank score. Node B has a large Page-
Rank because it has many incoming links.
The PageRank of node C is high because
it received the only outgoing link from
the important node B. Mathematically, the
World Wide Web is modelled as a graph
with pages as nodes and hyperlinks as di-
rected edges, and then a large set of equa-
tions is solved to find the PageRank values
for each node in the graph.
Right after PageRank was introduced,
its fast computation became a problem of
great interest because the Google matrix is
Figure 1 PageRank, example from Wikipedia.
Pagerank
'Easily bored' surfer. Consider a simple model of a surfer browsing web pages. With
probability a , the surfer follows a randomly chosen outgoing link of a page, and with
probability
the surfer is bored and jumps to a random page. Initially, Google
used
. PageRank of a page is the stationary (long-run) probability that the
surfer visits this page.
Eigenvector. Equivalently, PageRank is the so-called dominant left eigenvector of the
transition matrix of the above process: the entry
of this matrix is the probability
that the surfer on page i will proceed to page j. Such an eigenvector is unique. The
PageRank of a web page is the corresponding component of this unique dominant
left eigenvector.
284 NAW 5/19 nr. 4 december 2018 Mathematics for Big Data Alessandro Di Bucchianico et al.
turned out that the distance (the number
of hops along the edges of the Facebook
graph) between two Facebook users is on
average less than 4!
Healthcare
Medical devices like MRI scanners obtain
large image data at relatively low velocity.
Efforts are undertaken to reduce the time it
takes to makes scans (typically thirty min-
utes) since hospitals could obtain higher
efficiency of the expensive MRI equipment
and patients would suffer less from the un-
pleasant high noise levels. Making scans at
a lower resolution is not an option because
of medical reasons. An MRI scan uses mag-
netic fields to order the spins of hydrogen
atoms and radio waves to disturb these
spins. When the spins return to their origi-
nal position, energy is emitted. This energy
is measured so that one gets an indication
of the amount of tissue. Using magnetic
gradients it is possible to localize these
measurements.
The mathematical bottom line of this
procedure is that MRI scans produce Fouri-
er coefficients one by one. Traditional ap-
proaches to reconstruction algorithms can-
not yield the desired reduction of scanning
time because of the so-called Nyquist–
Shannon criterion. Again, advanced math-
ematical techniques have provided the
breakthrough. The basic idea is to project
Probability theory has been essential in
developing algorithms such as Count-Min
Sketch, MinHash and HyperLogLog that use
random hash functions to store answers.
Such algorithms may be accurate within
2% while using only memory in the order
of the (iterated) logarithm of the original
sample size. An important issue in devel-
oping in these algorithms is to control the
variance of the estimators, in order to get
consistently accurate estimates.
HyperLogLog is one of the most ele-
gant mathematical solutions for counting
distinct objects in Big Data applications,
widely used in practice. Researchers at
Google [15] state that Google's data anal-
ysis system PowerDrill routinely performs
about five million 'count distinct objects'
computations per day. In about one hun-
dred cases, the resulting number is greater
than one billion. In 2014 HyperLogLog was
implemented by Amazon's data structure
store Redis as well. An interesting human
interest note: the commands of HyperLo-
gLog begin with PF - the initials of the
French mathematician Philippe Flajolet who
developed this algorithm (see, e.g., [12] ).
Maybe even more exciting from a scien-
tific point of view was the result in [3]
where HyperLogLog was used to accom-
plish an incredible task of computing av-
erage distances in the complete Facebook
graph of more than 700 million nodes. It
described in terms of dynamical systems,
transferring information in time-ordered
observed data to a physical model of the
system. This process is often referred to
as data-assimilation. Its development has
been highly influenced by professionals
working in the atmospheric and oceano-
graphic sciences. When discretized in
space, a typical model for numerical weath-
er prediction is a differential equation sys-
tem with dimension of order 109 [18]. The
state variable of the dynamical system may
represent unknown quantities such as for
example velocity, temperature and pres-
sure at a grid of locations.
The application of mathematical models
to large dynamic data sets has naturally
popped up in many other communities as
well. Within signal processing recovering
the unknown state of the dynamical system
is known as filtering or smoothing, where
the first term refers to online recovery (as
opposed to static recovery). Probabilists
and statisticians usually speak of state and
parameter estimation. Over the past thirty
years there has been tremendous progress
for this kind of problems. Under specific
assumptions on the dynamical system
computationally efficient methods such as
the (ensemble) Kalman filter can be used.
In more general settings, a Bayesian for-
mulation of the problem and application
of Markov Chain Monte Carlo methods
and Sequential Monte Carlo methods can
be exploited (see, e.g., [21, 22] ). Where-
as these methods are presently not yet
applicable to weather forecasting, they
have proved to be powerful in simpli-
fied problems of less demanding dimen-
sions and constitute a very active area of
research [8, 17].
Web data analytics
Many companies collect large amounts of
customer data through their web services.
However, having these data does not mean
that we already know everything. Even
simple tasks like counting the number of
distinct records in a large customer da-
tabase (e.g., the number of distinct cus-
tomers that use a certain service) requires
advanced mathematics. The exact counting
is computationally prohibitive mainly be-
cause we cannot keep all objects in the
restricted working memory of a computer.
However, we might not need that level of
accuracy — in such cases it is often suffi-
cient to work with approximate estimates.
HyperLogLog
Hash functions. Each digital object is converted to a sequence of zero's and one's
using hash functions. On a set of different objects a good hash-function appears as if
randomly generated: zero's and one's have probability
, independently of each other.
Count zero's. The idea of LogLog-type algorithms is to sweep through objects keeping
in the memory only the largest number of zero's at the beginning hash functions. For
example, if we observed
00101, 10011, 01010,
we will remember 2, the largest number of zeros. Roughly, the probability to see 2
zero's followed by one at the beginning of the hash function is
$$=
so we conclude that we saw approximately 8 objects!
HyperLogLog. In this form, the estimation is obviously too rough, so it cannot be di-
rectly used in practice. A lot of mathematics went into making the result more precise.
This includes dividing hash functions into registers, using different corrections for small
and large samples, harmonic averages. All these ideas are included in HyperLogLog,
ensuring its applicability. Further improvements are possible, e.g., this was the goal
of the paper [15].
Why LogLog? Assume we have N objects. Then hash functions have length
gN
.
Hence, the number of zero's is a number between 0 and
gN
, so we need only
gl og N
bits of memory to remember this number.
Alessandro Di Bucchianico et al. Mathematics for Big Data NAW 5/19 nr. 4 december 2018 285
to certain traits or treatment effects. Net-
work analysis is getting more and more at-
tention (see, e.g., [20] ) as a means to bring
experimental results into the realm of the
things we already know about the biology
of the system — one of the main challenges
is to combine the different omics data lay-
ers into coherent models that explain the
behaviour of the system under study [14].
Precision farming
Agriculture is rapidly becoming a data-rich
environment, tractors currently being con-
nected to the Internet 24/7 and resembling
computers on (large) wheels rather than
the dusty and primitive muscle-machines
they were in the 20th century. As a result,
new questions can be addressed that were
unthinkable only ten, twenty years ago:
by combining several different information
sources (satellite images, plant growth
models, management data on plot level)
the farmer can, e.g., try to devise optimal
strategies to deliver the right amount of
water and nutrients to his land and in this
way obtain the highest possible yield (see,
e.g., [4, 10] and many others).
Here, the problems are the typical
big-data problems: even assuming one has
access to all databases and knows how to
read and use the data, it is not a trivial
question how to combine data with very
different characteristics, found in different
locations and measured for different pur-
poses. One thing is certain: mathematics
and statistics play a pivotal role.
Genomics and other omics sciences
Now that technology has become avail-
able (and affordable!) to rapidly obtain in-
formation about the genetic composition
of biological samples, huge quantities of
data are generated routinely. This is not
only true when looking at genetic informa-
tion (hence the term genomics) but also
when looking at proteins (proteomics) and
metabolites (metabolomics), to name just
two other members of the 'omics' fami-
ly. The Big Data aspect here refers to the
huge amount of information that we have
on a relatively small number of subjects. A
typical example is genetic information on
humans, animals or plants that consists of
millions of measurements (data points) for
each subject. The resulting 'high-dimen-
sional' data require the development of
new statistical techniques to draw correct
conclusions because traditional statistical
methods for such data lead to an unac-
ceptable high number of false positives
(see, e.g., [7] ).
Furthermore, advanced data processing
methods are needed to convert the mea-
sured data into information — one example
is the BLAST algorithm [2], incidentally also
the most highly cited paper of the nineties)
to align sequences of nucleotides or amino
acids with database entries. In each case
we are confronted with the issue mentioned
before: we know an awful lot about very few
samples, which makes statistical analysis
extremely hard. Typical questions are find-
ing genes, proteins or metabolites related
the observed data onto a smaller sub-
space using sparsity in the data. Remark-
ably, random projections yield sampling
strategies and reconstruction algorithms
that outperform traditional signal process-
ing techniques. These methods are known
under the name compressed sensing. For
other applications of compressed sens-
ing in healthcare, we refer to https://www.
healthcare.siemens.nl/magnetic-resonance-
imaging/clinical-specialities/compressed-
sensing.
Compressed sensing has been applied
successfully in a wide range of other tasks
as well, including network tomography,
electron microscopy, and facial recognition.
Recommender systems
Webshops like Amazon analyze the buying
behaviour of their customers and present
visitors of the Amazon website with recom-
mendations of books and other items that
may be of interest. In a similar way Netflix
gives suggestions for movies to its custom-
ers. A way to provide such recommenda-
tions is to set up a matrix of user ratings
of movies (columns are ratings, rows are
users). Of course, such a matrix has many
empty entries since there are many more
movies (Netflix has around 20,000) than
people can see and rate.
The idea behind the recommend-
er systems is that there are relatively
few 'latent' features that drive our pref-
erences (a sparsity principle). That is,
there are a few typical items (books or
movies) and a few typical users. Trans-
lated into matrices, this means looking
for a nonnegative matrix factorization of
the preference matrix. This means that a
very large and sparse preference matrix is
presented as a product of two matrices
with much lower dimensions. Although
computers become faster, this is main-
ly increase in CPU speed and much less
in faster memory. Factorizations of large
matrices, however, require a huge amount
of communication between working mem-
ory and storage memory. There is thus a
need for memory efficient factorization
algorithms that go far beyond traditional
factorization algorithms for singular value
decompositions (see, e.g., [16] ) for a tech-
nical account by the team that won the
One Million Dollar Netflix competition). An
exciting new approach in this field is the
use of randomized methods like stochas-
tic gradient algorithms (see [1]). Figure 2 Smart agriculture.
Photo: Shutterstock, MONOPOLY919
286 NAW 5/19 nr. 4 december 2018 Mathematics for Big Data Alessandro Di Bucchianico et al.
Institute of Mathematical Statistics presi-
dential address:
"Work on real problems, relevant theory
will follow."
(see http://bulletin.imstat.org/2014/10/ims-
presidential-address-let-us-own-data-science).
Hence the stress on the applications in this
paper: mathematics needs them, just like
the applications need mathematics. s
did not even expect to be solvable. It is
important to realize that advances in this
area have both a push and a pull com-
ponent: without being confronted with
real-life problems we might lack the incen-
tive or the direction to pursue promising
avenues, but without fundamental knowl-
edge we simply lack the tools to tackle the
problems successfully. This was expressed
in a concise way by Bin Yu in her 2014
Conclusion
Mathematics and statistics, being extreme-
ly generic tools, have played an important
part in technological and scientific develop-
ments over the last centuries, and will con-
tinue to do so also in this Big Data era. Not
only will they contribute to solving prob-
lems faster and more efficiently, they will
expand our horizon, exposing questions
that we never thought about and maybe
1 C. C. Aggarwal, Recommender Systems, Sprin-
ger, 2016
2 S. Altschul, W. Gish, W. Miller, E. Myers and
D. Lipman, Basic local alignment search
tool, Journal of Molecular Biology 215(3)
(1990), 403–410.
3 L. Backstrom, P. Boldi, M. Rosa, J. Ugander
and S. Vigna, Four degrees of separation,
Proceedings of the 4th Annual ACM Web
Science Conference, 2012, pp. 33–42.
4 J. Behmann, A. Mahlein, T. Rumpf, C. Römer
and L. Plümer, A review of advanced ma-
chine learning methods for the detection of
biotic stress in precision crop protection,
Precision Agriculture 16 (2015), 239–260.
5 P. Berkhin, A Survey on PageRank Computing,
Internet Mathematics 2(1) (2015), 73–120.
6 R. E. Bixby, A brief history of linear and
mixed-integer programming computation,
Documenta Mathematica (2012), 107–121.
7 P. Bühlmann and S. A. Van De Geer, Statistics
for High-dimensional Data: Methods, Theo-
ry and Applications, Springer, 2013.
8 A. Cuzol and E. A. Memin, Stochastic filter-
ing technique for fluid flow velocity fields
tracking, IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence 31(7) (2009),
1278–1293.
9 N. Chen, N. Litvak and M. Olvera Cravioto,
Generalized PageRank on directed configu-
ration networks, Random Structures & Algo-
rithms 51(2) (2017), 237–274.
10 D. E. Clay, S. A. Clay and S. A. Bruggeman,
eds., Practical Mathematics for Precision
Farming, ASA, CSSA and SSSA, 2017.
11 J. Fan, F. Han and H. Liu, Challenges of Big
Data Analysis, National Science Review 1(2)
(2014), 293–314.
12 P. Flajolet, E. Fusy, G. Olivier and F. Meuni-
er, Hyperloglog: The analysis of a near-op-
timal cardinality estimation algorithm, in
AofA'07: Proceedings of the 2007 Interna-
tional Conference on Analysis of Algorithms,
2007.
13 Frontiers in Massive Data Analysis, National
Academies Press, 2013.
14 R. D. Hawkins, G. C. Hon and B. Ren, Next-
generation genomics: an integrative ap-
proach, Nature Reviews Genetics 11 (2010),
476–486.
15 S. Heule, M. Nunkesser and A. Hall, Hyper-
LogLog in practice: Algorithmic engineering
of a state of the art cardinality estimation
algorithm, Proceedings of the 16th Interna-
tional Conference on Extending Database
Technology, 2013, pp. 683–692.
16 Y. Koren, R. Bell and C. Volinsky, Matrix fac-
torization techniques for recommender sys-
tems, Computer 8 (2009), 42–49.
17 K. J. H. Law and A. M. Stuart, Evaluating data
assimilation algorithms, Monthly Weather
Review 140 (2012), 3757–3782.
18 K. J. H. Law, A. M. Stuart and K. C. Zygalakis,
Data Assimilation. A Mathematical Introduc-
tion, Springer Texts in Applied Mathematics,
Vol. 62, Springer, 2015.
19 B. G. Lindsay, J. Kettenring and D. O. Sieg-
mund, A Report on the Future of Statistics,'
Statistical Science 19(3) (2004), 387–413.
20 K. Mitra, A. R. Carvunis, S. K. Ramesh and
T. Ideker, Integrative approaches for finding
modular structure in biological networks,
Nat. Rev. Genet. 14 (2013), 719–732.
21 C. P. Robert and G. Casella, Monte Carlo Sta-
tistical Methods, Springer Texts in Statistics,
Springer, 2004, 2nd edition.
22 S. Särkkä, Bayesian Filtering and Smooth-
ing, Cambridge University Press, 2013.
23 W. Schilders, Introduction to Model Order
Reduction, in Mathematics in Industry Mod-
el Order Reduction: Theory, Research As-
pects and Applications, Springer, 2008, pp.
3–32.
References
ResearchGate has not been able to resolve any citations for this publication.
-
Wil Schilders
In this first section we present a high level discussion on computational science, and the need for compact models of phenomena observed in nature and industry. We argue that much more complex problems can be addressed by making use of current computing technology and advanced algorithms, but that there is a need for model order reduction in order to cope with even more complex problems. We also go into somewhat more detail about the question as to what model order reduction is.
-
Kody Law
- Andrew M. Stuart
Data assimilation leads naturally to a Bayesian formulation in which the posterior probability distribution of the system state, given the observations, plays a central conceptual role. The aim of this paper is to use this Bayesian posterior probability distribution as a gold standard against which to evaluate various commonly used data assimilation algorithms. A key aspect of geophysical data assimilation is the high dimensionality and low predictability of the computational model. With this in mind, yet with the goal of allowing an explicit and accurate computation of the posterior distribution, we study the 2D Navier-Stokes equations in a periodic geometry. We compute the posterior probability distribution by state-of-the-art statistical sampling techniques. The commonly used algorithms that we evaluate against this accurate gold standard, as quantified by comparing the relative error in reproducing its moments, are 4DVAR and a variety of sequential filtering approximations based on 3DVAR and on extended and ensemble Kalman filters. The primary conclusions are that: (i) with appropriate parameter choices, approximate filters can perform well in reproducing the mean of the desired probability distribution; (ii) however they typically perform poorly when attempting to reproduce the covariance; (iii) this poor performance is compounded by the need to modify the covariance, in order to induce stability. Thus, whilst filters can be a useful tool in predicting mean behavior, they should be viewed with caution as predictors of uncertainty. These conclusions are intrinsic to the algorithms and will not change if the model complexity is increased, for example by employing a smaller viscosity, or by using a detailed NWP model.
- Christian P. Robert
- George Casella
La simulation est devenue dans la dernière décennie un outil essentiel du traitement statistique de modèles complexes et de la mise en oeuvre de techniques statistiques avancées, comme le bootstrap ou les méthodes d'inférence simulée. Ce livre présente les éléments de base de la simulation de lois de probabilité (génération de variables uniformes et de lois usuelles) et de leur utilisation en Statistique (intégration de Monte Carlo, optimisation stochastique). Après un bref rappel sur les chaînes de Markov, les techniques plus spécifiques de Monte Carlo par chaînes de Markov (MCMC) sont présentées en détail, à la fois du point de vue théorique (validité et convergence) et du point de vue de leur implémentation (accélération, choix de paramètres, limitations). Les algorithmes d'échantillonnage de Gibbs sont ainsi distingués des méthodes générales de Hastings-Metropolis par leur plus grande richesse théorique. Les derniers chapitres contiennent un exposé critique sur l'état de l'art en contrôle de convergence de ces algorithmes et une présentation unifiée des diverses applications des méthodes MCMC aux modèles à données manquantes. De nombreux exemples statistiques illustrent les méthodes présentées dans cet ouvrage destiné aux étudiants de deuxième et troisième cycles universitaires en Mathématiques Appliquées ainsi qu'aux chercheurs et praticiens désirant utiliser les méthodes MCMC. Monte Carlo statistical methods, particularly those based on Markov chains, are now an essential component of the standard set of techniques used by statisticians. This new edition has been revised towards a coherent and flowing coverage of these simulation techniques, with incorporation of the most recent developments in the field. In particular, the introductory coverage of random variable generation has been totally revised, with many concepts being unified through a fundamental theorem of simulation There are five completely new chapters that cover Monte Carlo control, reversible jump, slice sampling, sequential Monte Carlo, and perfect sampling. There is a more in-depth coverage of Gibbs sampling, which is now contained in three consecutive chapters. The development of Gibbs sampling starts with slice sampling and its connection with the fundamental theorem of simulation, and builds up to two-stage Gibbs sampling and its theoretical properties. A third chapter covers the multi-stage Gibbs sampler and its variety of applications. Lastly, chapters from the previous edition have been revised towards easier access, with the examples getting more detailed coverage. This textbook is intended for a second year graduate course, but will also be useful to someone who either wants to apply simulation techniques for the resolution of practical problems or wishes to grasp the fundamental principles behind those methods. The authors do not assume familiarity with Monte Carlo techniques (such as random variable generation), with computer programming, or with any Markov chain theory (the necessary concepts are developed in Chapter 6). A solutions manual, which covers approximately 40% of the problems, is available for instructors who require the book for a course. oui
- Anne Cuzol
-
Étienne Mémin
In this paper, we present a method for the temporal tracking of fluid flow velocity fields. The technique we propose is formalized within a sequential Bayesian filtering framework. The filtering model combines an Itô diffusion process coming from a stochastic formulation of the vorticity-velocity form of the Navier-Stokes equation and discrete measurements extracted from the image sequence. In order to handle a state space of reasonable dimension, the motion field is represented as a combination of adapted basis functions, derived from a discretization of the vorticity map of the fluid flow velocity field. The resulting nonlinear filtering problem is solved with the particle filter algorithm in continuous time. An adaptive dimensional reduction method is applied to the filtering technique, relying on dynamical systems theory. The efficiency of the tracking method is demonstrated on synthetic and real-world sequences.
A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straightforward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.
This paper studies the distribution of a family of rankings, which includes Google's PageRank, on a directed configuration model. In particular, it is shown that the distribution of the rank of a randomly chosen node in the graph converges in distribution to a finite random variable that can be written as a linear combination of i.i.d. copies of the attracting endogenous solution to a stochastic fixed‐point equation of the form where is a real‐valued vector with , and the are i.i.d. copies of , independent of . Moreover, we provide precise asymptotics for the limit , which when the in‐degree distribution in the directed configuration model has a power law imply a power law distribution for with the same exponent. © 2016 Wiley Periodicals, Inc. Random Struct. Alg., 51, 237–274, 2017
Effective crop protection requires early and accurate detection of biotic stress. In recent years, remarkable results have been achieved in the early detection of weeds, plant diseases and insect pests in crops. These achievements are related both to the development of non-invasive, high resolution optical sensors and data analysis methods that are able to cope with the resolution, size and complexity of the signals from these sensors. Several methods of machine learning have been utilized for precision agriculture such as support vector machines and neural networks for classification (supervised learning); k-means and self-organizing maps for clustering (unsupervised learning). These methods are able to calculate both linear and non-linear models, require few statistical assumptions and adapt flexibly to a wide range of data characteristics. Successful applications include the early detection of plant diseases based on spectral features and weed detection based on shape descriptors with supervised or unsupervised learning methods. This review gives a short introduction into machine learning, analyses its potential for precision crop protection and provides an overview of instructive examples from different fields of precision agriculture.
A central goal of systems biology is to elucidate the structural and functional architecture of the cell. To this end, large and complex networks of molecular interactions are being rapidly generated for humans and model organisms. A recent focus of bioinformatics research has been to integrate these networks with each other and with diverse molecular profiles to identify sets of molecules and interactions that participate in a common biological function - that is, 'modules'. Here, we classify such integrative approaches into four broad categories, describe their bioinformatic principles and review their applications.
Mathematics For Data Analytics Pdf
Source: https://www.researchgate.net/publication/337037566_Mathematics_for_Big_Data
Posted by: whanghattertion58.blogspot.com

0 Response to "Mathematics For Data Analytics Pdf"
Post a Comment