Title: | Clustering of Functional Data via Mixtures of t-Distributions |
---|---|
Description: | Extension of 'funHDDC' Schmutz et al. (2018) <doi:10.1007/s00180-020-00958-4> for cases including outliers by fitting t-distributions for robust groups. 'TFunHDDC' can cluster univariate or multivariate data produced by the 'fda' package for data using a b-splines or Fourier basis. |
Authors: | Cristina Anton [aut, cre], Iain Smith [aut], Malcolm Nielsen [aut], Jeffrey Andrews [ctb], Jaymeson Wickins [ctb], Nicholas Boers [ctb], Paul McNicholas [ctb], Amandine Schmutz [ctb], Julien Jacques [ctb], Charles Bouveyron [ctb] |
Maintainer: | Cristina Anton <[email protected]> |
License: | GPL-3 |
Version: | 1.0.1 |
Built: | 2025-02-24 04:18:19 UTC |
Source: | https://github.com/cran/TFunHDDC |
Simulated triangle data produced as an example of mild contaminated data with behavioural outliers.
fd |
A functional data object representing the fitted triangle data. |
groupd |
Group classifications for each curve as a ordinary behavioral group (1,2,3, or 4) or outlier (4 as outliers to group 1 and 6 as outliers for group 3). |
Cristina Anton and Iain Smith
- Cristina Anton, Iain Smith Model-based clustering of functional data via mixtures of distributions. Advances in Data Analysis and Classification (to appear).
Extract NOx data from fda.usc
fitNOxBenchmark(nbasis=15)
fitNOxBenchmark(nbasis=15)
nbasis |
The number of basis functions to fit to the simulated data. |
Open NOx data from the poblenou data set of fda.usc. Fit the data to a given number of basis functions and adjust classes for festive days into just weekdays and weekends.
fd |
A functional data object representing the fitted NOx data. |
groupd |
Group classifications for each curve as a curve representing a weekday or weekend/festive day. |
Cristina Anton and Iain Smith
- Febrero-Bande M, Galeano P, Gonz~alez-Manteiga W (2008) Outlier detection in functional data by depth measures, with application to identify abnormal nox levels. Environmetrics 19:331-345. <//doi.org/10.1002/env.878>.
- Cristina Anton, Iain Smith Model-based clustering of functional data via mixtures of distributions. Advances in Data Analysis and Classification (to appear).
# Univariate Contaminated Data data1<-fitNOxBenchmark(15) plotNOx(data1)
# Univariate Contaminated Data data1<-fitNOxBenchmark(15) plotNOx(data1)
Generate functional data with coefficients distributed according to a finite mixture of contamined normal distributions such that for the th cluster we have the
multivariate contaminated normal distribution with density
where represents the proportion of uncontaminated data,
is the inflation coefficient due to outliers, and
is the density for the multivariate normal distribution
.
genModelFD(ncurves=1000, nsplines=35, alpha=c(0.9,0.9,0.9), eta=c(10, 5, 15))
genModelFD(ncurves=1000, nsplines=35, alpha=c(0.9,0.9,0.9), eta=c(10, 5, 15))
ncurves |
The number of curves total for the simulation. |
nsplines |
The number of splines to fit to the simulated data. |
alpha |
The proportion of uncontaminated data in each group. |
eta |
The inflation coefficient that measures the increase in variability due to the outliers. |
The data are generate from the model .
The number of clusters is fixed to
and the mixing proportions are equal
. We consider the following values of the parameters
Group 1:,
,
,
Group 2: ,
,
,
Group 3: ,
,
,
,
where is the intrinsic dimension of the subgroups,
is the mean vector of size 70,
is the values of the
-first diagonal elements of
, and
the value of the last
- elements. Curves as smoothed using 35 Fourier basis functions.
fd |
A functional data object representing the simulated data. |
groupd |
Group classifications for each curve. |
Cristina Anton and Iain Smith
- Amovin-Assagba M, Gannaz I, Jacques J (2022) Outlier detection in multivariate
functional data through a contaminated mixture model. Comput Stat
Data Anal 174.
- Cristina Anton, Iain Smith Model-based clustering of functional data via mixtures of distributions. Advances in Data Analysis and Classification (to appear).
# Univariate Contaminated Data data <- genModelFD(ncurves=300, nsplines=35, alpha=c(0.9,0.9,0.9), eta=c(10, 7, 17)) plot(data$fd, col = data$groupd) clm <- data$groupd
# Univariate Contaminated Data data <- genModelFD(ncurves=300, nsplines=35, alpha=c(0.9,0.9,0.9), eta=c(10, 7, 17)) plot(data$fd, col = data$groupd) clm <- data$groupd
Generate contaminated triangle data. Groups 1, 2, 3, and 4 are separable over the two dimensions of functional data. Groups 5 and 6 contain the contaminated curves of groups 1 and 3 respectively.
genTriangles()
genTriangles()
Group 1:
Contaminated
Contaminated
Group 2:
Group 3:
Contaminated
Contaminated
Group 4:
Here
,
, and
, with
representing the positive part.
, and
,
,
are mutually independent white noises and independent of
. We simulate 100 curves for each group, groups 1 and 3 consisting of 80 ordinary curves and 20 contaminated curves. Curves are smoothed using a 25 cubic B-spline basis.
fd |
List of functional data objects representing the two dimensions of triangle data. |
groupd |
Group classification for each curve |
Cristina Anton and Iain Smith
- C.Bouveyron and J.Jacques (2011), Model-based Clustering of Time Series in Group-specific Functional Subspaces, Advances in Data Analysis and Classification, vol. 5 (4), pp. 281-300, <doi:10.1007/s11634-011-0095-6>
- Schmutz A, Jacques J, Bouveyron C, et al (2020) Clustering multivariate functional data in group-specific functional subspaces. Comput Stat 35:1101-1131
- Cristina Anton, Iain Smith Model-based clustering of functional data via mixtures of distributions. Advances in Data Analysis and Classification (to appear).
# Multivariate Contaminated Triangles conTrig <- genTriangles() cls = conTrig$groupd plotTriangles(conTrig)
# Multivariate Contaminated Triangles conTrig <- genTriangles() cls = conTrig$groupd plotTriangles(conTrig)
Plot data returned by fitNOxBenchmark
as lines coloured according to the assigned clusters.
plotNOx(fdn)
plotNOx(fdn)
fdn |
List with an element |
No return value, used for side effects.
Cristina Anton and Iain Smith
# Univariate Contaminated Data data1<-fitNOxBenchmark(15) plotNOx(data1)
# Univariate Contaminated Data data1<-fitNOxBenchmark(15) plotNOx(data1)
Plot data returned by genTriangles
as lines coloured according to the assigned clusters.
plotTriangles(fdt)
plotTriangles(fdt)
fdt |
List with an element |
No return value, used for side effects.
Cristina Anton and Iain Smith
conTrig <- genTriangles() plotTriangles(conTrig)
conTrig <- genTriangles() plotTriangles(conTrig)
Provides the matrix of classification probabilities and the classification vector for inputted observations assuming the model provided by the tfunHDDC
object.
## S3 method for class 'tfunHDDC' predict(object, data=NULL, ...)
## S3 method for class 'tfunHDDC' predict(object, data=NULL, ...)
object |
An object of class |
data |
Data frame (univariate funtional data) or a list (multivariate functional data) of new observations on the same variables used in the fitting of the |
... |
Arguments to be passed to other functions. |
t |
Matrix of classification probabilities |
class |
Vector of maximum a posteriori classifications |
Cristina Anton, Iain Smith
-Andrews JL, McNicholas PD (2012) Model-based clustering, classication, and discriminant analysis via mixtures of multivariate t-distributions: The teigen family. Stat Comput 22:10211029. <doi.org/10.1007/ s11222-011-9272-x>
-Andrews JL, Wickins JR, Boers NM, et al (2018) An R package for modelbased clustering and classication via the multivariate t distribution. Journal of Statistical Software 83(7):1-32
- Cristina Anton, Iain Smith Model-based clustering of functional data via mixtures of distributions. Advances in Data Analysis and Classification (to appear).
set.seed(1027) #simulataed univariate data data = genModelFD(ncurves=300, nsplines=35, alpha=c(0.9,0.9,0.9), eta=c(10, 7, 17)) plot(data$fd, col = data$groupd) clm = data$groupd model1=c("AkjBkQkDk", "AkjBQkDk", "AkBkQkDk", "ABkQkDk", "AkBQkDk", "ABQkDk") ####################classification example with predictions training=c(1:50,101:150, 201:250) test=c(51:100,151:200, 251:300) known1=clm[training] t4<-tfunHDDC(data$fd[training],K=3,threshold=0.2,init="kmeans",nb.rep=1, dfconstr="no", dfupdate="numeric", model=model1[1],known=known1, itermax = 10) if (!is.null(t4$class)) { table(clm[training], t4$class) p1<-predict.tfunHDDC(t4,data$fd[test] ) if (!is.null(p1$class)) table(clm[test], p1$class) } ###########################NOX data data1=fitNOxBenchmark(15) plotNOx(data1) ###example for prediction training=c(1:50) test=c(51:115) known1=data1$groupd[training] t1<-tfunHDDC(data1$fd[training],K=2,threshold=0.6,init="kmeans",nb.rep=10, dfconstr="no", model=c("AkjBkQkDk", "AkjBQkDk", "AkBkQkDk", "ABkQkDk", "AkBQkDk", "ABQkDk"),known=known1) if (!is.null(t1$class)) { table(data1$groupd[training], t1$class) p1<-predict.tfunHDDC(t1,data1$fd[test] ) if (!is.null(p1$class)) table(data1$groupd[test], p1$class) }
set.seed(1027) #simulataed univariate data data = genModelFD(ncurves=300, nsplines=35, alpha=c(0.9,0.9,0.9), eta=c(10, 7, 17)) plot(data$fd, col = data$groupd) clm = data$groupd model1=c("AkjBkQkDk", "AkjBQkDk", "AkBkQkDk", "ABkQkDk", "AkBQkDk", "ABQkDk") ####################classification example with predictions training=c(1:50,101:150, 201:250) test=c(51:100,151:200, 251:300) known1=clm[training] t4<-tfunHDDC(data$fd[training],K=3,threshold=0.2,init="kmeans",nb.rep=1, dfconstr="no", dfupdate="numeric", model=model1[1],known=known1, itermax = 10) if (!is.null(t4$class)) { table(clm[training], t4$class) p1<-predict.tfunHDDC(t4,data$fd[test] ) if (!is.null(p1$class)) table(clm[test], p1$class) } ###########################NOX data data1=fitNOxBenchmark(15) plotNOx(data1) ###example for prediction training=c(1:50) test=c(51:115) known1=data1$groupd[training] t1<-tfunHDDC(data1$fd[training],K=2,threshold=0.6,init="kmeans",nb.rep=10, dfconstr="no", model=c("AkjBkQkDk", "AkjBQkDk", "AkBkQkDk", "ABkQkDk", "AkBQkDk", "ABQkDk"),known=known1) if (!is.null(t1$class)) { table(data1$groupd[training], t1$class) p1<-predict.tfunHDDC(t1,data1$fd[test] ) if (!is.null(p1$class)) table(data1$groupd[test], p1$class) }
tfunHDDC is an adaptation of funHDDC (Schmutz et al., 2018) that uses t-distributions for robust clustering in the presence of outliers.
tfunHDDC(data, K=1:10, model="AkjBkQkDk", known=NULL,dfstart=50, dfupdate="approx", dfconstr="no", threshold=0.1, itermax=200, eps=1e-6, init='random', criterion="bic", d_select="Cattell", init.vector=NULL, show=TRUE, mini.nb=c(5, 10), min.individuals=2, mc.cores=1, nb.rep=2, keepAllRes=TRUE, kmeans.control = list(), d_max=100, d_range=2, verbose = TRUE)
tfunHDDC(data, K=1:10, model="AkjBkQkDk", known=NULL,dfstart=50, dfupdate="approx", dfconstr="no", threshold=0.1, itermax=200, eps=1e-6, init='random', criterion="bic", d_select="Cattell", init.vector=NULL, show=TRUE, mini.nb=c(5, 10), min.individuals=2, mc.cores=1, nb.rep=2, keepAllRes=TRUE, kmeans.control = list(), d_max=100, d_range=2, verbose = TRUE)
data |
In the univariate case: a functional data object produced by the fda package. In the multivariate case: a list of functional data objects. |
K |
The number of clusters or list of clusters to try, for example K=2:10. |
dfstart |
The df (degrees of freedom) to which we initialize the t-distribution. |
dfupdate |
Either "numeric", or "approx". The default is "approx" indicating a closed form approximation be used. Alternatively, "numeric" can be specified which makes use of |
dfconstr |
"yes" when df (degrees of freedom) for the t-distribution should be the same between all clusters; "no" when df may be different between clusters. |
model |
The chosen model among 'AkjBkQkDk', 'AkjBQkDk', 'AkBkQkDk', 'ABkQkDk', 'AkBQkDk', 'ABQkDk'. 'AkjBkQkDk' is the default. We can test multiple models at the same time with the command c(). For example c("AkjBkQkDk","AkjBQkDk"). |
threshold |
The threshold of the Cattell' scree-test used for selecting the group-specific intrinsic dimensions. |
known |
A vector of known classifications that can be numeric or |
itermax |
The maximum number of iterations. |
eps |
The threshold of the convergence criterion. |
init |
A character string. It is the way to initialize the EM algorithm. There are five ways of initialization: “kmeans” (default), “param”, “random”, “mini-em”, “vector”, or "tkmeans". See details for more information. It can also be directly initialized with a vector containing the prior classes of the observations. |
criterion |
The criterion used for model selection: bic (default) or icl. |
d_select |
“Cattell” (default), “BIC”, or "grid". This parameter selects which method to use to select the intrinsic dimensions of subgroups. "grid" will select d based on thecriterion value after running each combination of d1, d2, ..., dK for the groups. d used for each group is based on the values for d_range. "grid" will only work for a single value of K (not a list). See details for more information. |
init.vector |
A vector of integers or factors. It is a user-given initialization. It should be of the same length as of the data. Only used when init="vector". |
show |
Use show = FALSE to settle off the informations that may be printed. |
mini.nb |
A vector of integers of length two. This parameter is used in the “mini-em” initialization. The first integer sets how many times the algorithm is repeated; the second sets the maximum number of iterations the algorithm will do each time. For example, if init=“mini-em” and mini.nb=c(5,10), the algorithm wil be launched 5 times, doing each time 10 iterations; finally the algorithm will begin with the initialization that maximizes the log-likelihood. |
min.individuals |
This parameter is used to control for the minimum population of a class. If the population of a class becomes stricly inferior to 'min.individuals' then the algorithm stops and gives the message: 'pop<min.indiv.'. Here the meaning of "population of a class" is the sum of its posterior probabilities. The value of 'min.individuals' cannot be lower than 2. |
mc.cores |
Positive integer, default is 1. If mc.cores>1, then parallel computing is used, using mc.cores cores. Warning for Windows users only: the parallel computing can sometimes be slower than using one single core (due to how parLapply works). |
nb.rep |
A positive integer (default is 1 for kmeans initialization and 20 for random initialization). Each estimation (i.e. combination of (model, K, threshold)) is repeated nb.rep times and only the estimation with the highest log-likelihood is kept. |
keepAllRes |
Logical. Should the results of all runs be kept? If so, an argument all_results is created in the results. Default is TRUE. |
kmeans.control |
A list. The elements of this list should match the parameters of the |
d_max |
A positive integer. The maximum number of dimensions to be computed. Default is 100. It means that the instrinsic dimension of any cluster cannot be larger than d_max. It quickens a lot the algorithm for datasets with a large number of variables (e.g. thousands). |
d_range |
Vector of values to use for the intrinsic dimension for each group when d_select="grid". |
verbose |
Whether to print progress and approximate timing information as tfunHDDC executes. TRUE (default when running in serial) or FALSE (default when running parallel). |
If we choose init="random", the algorithm is run 20 times with the same model options and the solution which maximises the log-likelihood is printed. This explains why sometimes with this initialization it runs a bit slower than with 'kmeans' initialization.
If the warning message: "In tfunHDDC(...) : All models diverged" is printed, it means that the algorithm found less classes that the chosen number (parameter K). Because the EM algorithm is used, it could be because of a bad initialization of the EM algorithm. So we have to restart the algorithm multiple times in order to check if with a new initialization of the EM algorithm the model converges, or if there is no solution with the chosen number K.
The different initializations are:
“mini-em”: it is an initialization strategy for which the classes are randomly initialized and the EM algorithm is run for several iterations. This action is repetead a few times (the default is 5 iterations and 10 times). At the end, the initialization chosen is the one which maximise the log-likelihood (see mini.nb for more information about its parameters).
“random”: the classes are randomly given using a multinomial distribution
“kmeans”: the classes are initialized using the kmeans function (with algorithm="Hartigan-Wong"; nstart=4; iter.max=50); note that the user can use his own arguments for kmeans using the dot-dot-dot argument
“tkmeans”: the classes are initialized using the tkmeans function (with same default initialization as kmeans); note that the user can use his own arguments for tkmeans using the dot-dot-dot argument
A prior class "vector": It can also be directly initialized with a vector containing the prior classes of the observations. To do so use init="vector" and provide the vector in the argument init.vector.
Note that the BIC criterion used in this function is to be maximized and is defined as 2*LL-k*log(n) where LL is the log-likelihood, k is the number of parameters and n is the number of observations.
There are three methods for selecting the intrinsic dimension using d_select:
"Cattell": Runs a Cattell's scree test to approximate the intrinsic dimension that yields the greatest improvement in clustering.
"BIC": At each iteration we tests each value for each group's intrinsic dimension and sets the intrinsic dimension that yields the best BIC.
"grid": Runs every combination of hyperparameters (eg. K=2, threshold = 0.05, model = ...) for every combination of intrinsic dimensions that can be set with the given d_range (with K = 2 and d_range = c(2, 10) it would set (2,2), (2, 10), (10, 2), and (10, 10)). Due to the sharp increase in test cases it is recommended that this mode is run in parallel if possible. Doing an intial short run to approximate the timing with verbose = TRUE is suggested as well.
d |
The number of dimensions for each cluster. |
a |
Values of parameter a for each cluster. |
b |
Values of parameter b for each cluster. |
mu |
The mean of each cluster in the original space. |
prop |
The proportion of individuals in each cluster. |
loglik |
The maximum of log-likelihood. |
loglik_all |
The log-likelihood at each iteration. |
posterior |
The posterior probability for each individual to belong to each cluster. |
class |
The clustering partition. |
BIC |
The BIC value. |
ICL |
The ICL value. |
complexity |
the number of parameters that are estimated. |
all_results |
if multiple number of clusters or models are considered, results for each model are stored here |
nux |
Values for the degrees of freedom of the t-distributions for each group. |
Cristina Anton, Iain Smith, and Malcolm Nielsen
- Andrews JL and McNicholas PD. “Model-based clustering, classification, and discriminant analysis with the multivariate t-distribution: The tEIGEN family” Statistics and Computing 22(5), 1021–1029.
- Andrews JL, McNicholas PD, and Subedi S (2011) “Model-based classification via mixtures of multivariate t-distributions” Computational Statistics and Data Analysis 55, 520–529.
- C.Bouveyron and J.Jacques, Model-based Clustering of Time Series in Group-specific Functional Subspaces, Advances in Data Analysis and Classification, vol. 5 (4), pp. 281-300, 2011 <doi:10.1007/s11634-011-0095-6>
- Schmutz A, Jacques J, Bouveyron C, et al (2020) Clustering multivariate functional data in group-specific functional subspaces. Comput Stat 35:1101-1131
- Cristina Anton, Iain Smith Model-based clustering of functional data via mixtures of distributions. Advances in Data Analysis and Classification (to appear).
teigen
, kmeans
, tkmeans
,predict.tfunHDDC
set.seed(1027) #simulataed univariate data data = genModelFD(ncurves=300, nsplines=35, alpha=c(0.9,0.9,0.9), eta=c(10, 7, 17)) plot(data$fd, col = data$groupd) clm = data$groupd model1=c("AkjBkQkDk", "AkjBQkDk", "AkBkQkDk", "ABkQkDk", "AkBQkDk", "ABQkDk") t1<-tfunHDDC(data$fd,K=3,threshold=0.2,init="kmeans",nb.rep=1,dfconstr="no", dfupdate="numeric", model=model1[1], itermax=10) if (!is.null(t1$class)) table(clm, t1$class) ###############example when some classifications are known known1=rep(NA,1,300) known1[1]=clm[1] known1[103]=clm[103] known1[250]=clm[250] t2<-tfunHDDC(data$fd,K=3,threshold=0.2,init="kmeans",nb.rep=1,dfconstr="no", dfupdate="numeric", model=model1[1],known=known1, itermax=10) if (!is.null(t2$class)) table(clm, t2$class) ####### example when some classifications are known known1=rep(NA,1,300) known1[1:100]=rep(3,1,50) t3<-tfunHDDC(data$fd,K=3,threshold=0.2,init="kmeans",nb.rep=1,dfconstr="no", dfupdate="numeric", model=model1[1],known=known1, itermax=10) if (!is.null(t3$class)) table(clm, t3$class) ############################multivariate simulated data set.seed(1027) conTrig <- genTriangles() cls = conTrig$groupd # groups 5 and 6 (contaminated) into 1 and 3 respectively res_s = tfunHDDC(conTrig$fd, K=4, dfconstr="no", dfupdate="numeric", model="ABKQKDK", init="kmeans", threshold=0.2, nb.rep=1, itermax=10) if (!is.null(res_s$class)) table(cls, res_s$class)
set.seed(1027) #simulataed univariate data data = genModelFD(ncurves=300, nsplines=35, alpha=c(0.9,0.9,0.9), eta=c(10, 7, 17)) plot(data$fd, col = data$groupd) clm = data$groupd model1=c("AkjBkQkDk", "AkjBQkDk", "AkBkQkDk", "ABkQkDk", "AkBQkDk", "ABQkDk") t1<-tfunHDDC(data$fd,K=3,threshold=0.2,init="kmeans",nb.rep=1,dfconstr="no", dfupdate="numeric", model=model1[1], itermax=10) if (!is.null(t1$class)) table(clm, t1$class) ###############example when some classifications are known known1=rep(NA,1,300) known1[1]=clm[1] known1[103]=clm[103] known1[250]=clm[250] t2<-tfunHDDC(data$fd,K=3,threshold=0.2,init="kmeans",nb.rep=1,dfconstr="no", dfupdate="numeric", model=model1[1],known=known1, itermax=10) if (!is.null(t2$class)) table(clm, t2$class) ####### example when some classifications are known known1=rep(NA,1,300) known1[1:100]=rep(3,1,50) t3<-tfunHDDC(data$fd,K=3,threshold=0.2,init="kmeans",nb.rep=1,dfconstr="no", dfupdate="numeric", model=model1[1],known=known1, itermax=10) if (!is.null(t3$class)) table(clm, t3$class) ############################multivariate simulated data set.seed(1027) conTrig <- genTriangles() cls = conTrig$groupd # groups 5 and 6 (contaminated) into 1 and 3 respectively res_s = tfunHDDC(conTrig$fd, K=4, dfconstr="no", dfupdate="numeric", model="ABKQKDK", init="kmeans", threshold=0.2, nb.rep=1, itermax=10) if (!is.null(res_s$class)) table(cls, res_s$class)