k-Center Clustering with Outliers in Sliding Windows

Pellizzoni, Paolo; Pietracaprina, Andrea; Pucci, Geppino

doi:10.3390/a15020052

Open AccessArticle

k-Center Clustering with Outliers in Sliding Windows

by

Paolo Pellizzoni

,

Andrea Pietracaprina

and

Geppino Pucci

^*

Department of Information Engineering, University of Padova, Via Gradenigo 6/B, 35131 Padova, Italy

^*

Author to whom correspondence should be addressed.

Algorithms 2022, 15(2), 52; https://doi.org/10.3390/a15020052

Submission received: 11 January 2022 / Revised: 28 January 2022 / Accepted: 29 January 2022 / Published: 31 January 2022

(This article belongs to the Special Issue Discrete Optimization Theory, Algorithms, and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Metric k-center clustering is a fundamental unsupervised learning primitive. Although widely used, this primitive is heavily affected by noise in the data, so a more sensible variant seeks for the best solution that disregards a given number z of points of the dataset, which are called outliers. We provide efficient algorithms for this important variant in the streaming model under the sliding window setting, where, at each time step, the dataset to be clustered is the window W of the most recent data items. For general metric spaces, our algorithms achieve

O (1)

approximation and, remarkably, require a working memory linear in

k + z

and only logarithmic in

| W |

. For spaces of bounded doubling dimension, the approximation can be made arbitrarily close to 3. For these latter spaces, we show, as a by-product, how to estimate the effective diameter of the window W, which is a measure of the spread of the window points, disregarding a given fraction of noisy distances. We also provide experimental evidence of the practical viability of the improved clustering and diameter estimation algorithms.

Keywords:

k-center with outliers; effective diameter; big data; data stream model; sliding windows; coreset; doubling dimension; approximation algorithms

1. Introduction

In a number of modern scenarios (e.g., social network analysis, online finance, online transaction processing, etc.), data are produced as a continuous stream, and at such a high rate that on-the-fly processing can only afford to maintain a small portion of the data in memory, together with a limited amount of working space. This computational scenario is captured by the well-established streaming model [1]. The sliding window setting [2,3] introduces the additional desirable constraint that the input for the problem of interest consists of the window W of the most recent data items, whereas older data are considered “stale” and disregarded by the computation.

The k-center clustering problem (k-center, for short) is a fundamental unsupervised learning primitive with ubiquitous applications [4,5,6]. Given a set W of points from a metric space, the k-center problem requires determining a subset

C \subset W

of k centers that minimize the maximum distance of any point of W from its closest center. However, since the objective function involves a maximum, the solution is at risk of being severely influenced by a few “distant” points, which are called outliers. In fact, the presence of outliers is inherent in many large datasets, since these points are often artifacts of data collection, either representing noisy measurements or simply erroneous information. To cope with this limitation, the k-center admits a heavily studied robust formulation that takes into account outliers [7,8,9]: when computing the objective function for a set of k centers, the z largest distances from the centers are to be discarded, where

z < | W |

is an additional input parameter representing the tolerable level of noise. This formulation is known as the k-center problem with z outliers. We remark that the values of k and z are provided by the user and are typically chosen using either domain knowledge or some internal quality measure (see e.g., [10,11]).

The decision versions of the k-center problem and its variant with z outliers are NP-complete [12]; hence, only approximate solutions may be returned within reasonable time bounds. In this paper, we present approximation algorithms for the k-center problem with z outliers in the sliding window setting. Moreover, as a by-product, we also derive an algorithm to estimate the

α

-effective diameter of the window, which is an interesting measure of the spread of a noisy dataset.

1.1. Related Work

In the sequential setting, the k-center problem (without outliers) admits simple 2-approximation algorithms [12,13]. Recently, a sequential, randomized, fully dynamic (i.e., admitting arbitrary insertions and deletions of points)

(2 + ε)

-approximation algorithm for the problem was presented in [14]. This algorithm features update times linear in k and

1 / ε

, and polylogarithmic in the aspect ratio

Δ = d_{\max} / d_{\min}

of the pointset, that is, the ratio between the maximum distance

d_{\max}

and minimum distance

d_{\min}

of any two points in the set. For the k-center problem with z outliers, a simple, fully combinatorial 3-approximation algorithm running in

{O (k | W |}^{2} \log (| W |))

time was devised in [7]. In [8,9], an adaptation of this latter algorithm was proposed for weighted pointsets (where z represents the aggregate weight of the outliers). Recently, sequential 2-approximation algorithms were developed in [15,16] that, however, are based on a more complex LP-based approach that is less amenable to practical implementation. A bi-criteria randomized algorithm, which is only suitable for (small) constant k, that returns a 2-approximation as long as a slightly larger number of outliers is excluded from the objective function, was presented in [17].

For the classical streaming setting, which seeks, at any step, a solution for the full stream seen so far, a

(4 + ε)

-approximation algorithm for the k-center problem with z outliers was given in [18]. The approximation was later improved to

3 + ε

in [9]. Whereas the former algorithm requires

O (k z / ε)

working memory space, the latter requires space

O ((k + z) {(c / ε)}^{D})

, where c is a constant and D is the doubling dimension of the input stream, which is a widely used generalization of the notion of Euclidean dimension to arbitrary metrics [19,20], and is formally defined in Section 2.2.

For the stricter sliding window setting, in [21], the authors devised an algorithm able to compute a

(6 + ε)

-approximation to the k-center problem for the current window, while keeping

O (k \log (Δ) / ε)

points stored in the working memory. Based on the same techniques, the authors also developed a

(3 + ε)

-approximation algorithm for the diameter of the current window. In [22], we presented a

(2 + ε)

-approximation algorithm for the k-center problem in the sliding window setting, where the improved approximation is obtained at the expense of a blow-up of a factor

O ({(c / ε)}^{D})

in the working memory, with c constant.

Concerning the k-center problem with z outliers in the sliding window setting, the only known algorithm was devised very recently in [23]. At every time step, the algorithm maintains an

ε

-coreset for the problem on the current window, namely, a subset of the window points, such that, if used as an input of any c-approximation sequential algorithm for k-center with z outliers, yields a

(c + ε)

-approximate solution to the problem on the entire window. The algorithm requires the knowledge of the aspect ratio of the stream, and uses

O (\log (Δ) k z {(1 / ε)}^{D})

working memory. Moreover, since it relies on the execution of a sequential algorithm for k-center with z outliers on the coreset every time a new point is added, it has an

O (\log (Δ) {(k z {(1 / ε)}^{D})}^{3})

update time per point, which makes it very impractical for streams with high arrival rates, and when values of k and z are not too small. In [23], the authors also prove that any sliding-window algorithm for k-center with z outliers, which features a

(1 + ε)

approximation ratio, must use

Ω (\log (Δ) k z / ε)

working memory.

In [24,25], sliding window algorithms have been proposed for the k-median and k-means clustering problems, whose objective is to minimize the average distance and squared distance of all window points from the closest centers, respectively. For distributed solutions to the k-center problem (with and without outliers) targeting the volume rather than the velocity of the data, see [6,9] and the references therein.

The notion of the

α

-effective diameter was introduced in [26] in the context of graph analytics to characterize the growth rate of the neighborhood function, but it naturally extends to general metrics, providing a robust substitute of the diameter in the presence of noise. For

α \in (0, 1)

, the

α

-effective diameter of a metric dataset W is the minimum threshold such that the distances of at least

{α | W |}^{2}

pairs of window points fall below the threshold.

1.2. Our Contribution

We present approximation algorithms for the k-center with z outliers in the sliding window setting, which feature constant approximation ratios and small working memory requirements and update times. As is customary for clustering in the big data realm, our algorithms hinge on the maintenance of a coreset, that is, a small subset of representative points of the current window from which an accurate solution can be extracted. Crucial to the effectiveness of our approach is the introduction of weights for coreset points, where the weight of a point is (an estimate of) the number of window points it represents. To efficiently maintain the weights, we employ a succinct data structure inspired by the smooth histograms of [27], which enable considerable space savings if some slackness in the account of outliers is permitted. As an interesting by-product, we also devise an algorithm that employs our coresets to approximate the

α

-effective diameter of the current window W.

The analysis of our algorithms is carried out as a function of two design parameters,

δ

and

λ

, which control, respectively, the level of accuracy and the slackness in the account of outliers, and as a function of a number of characteristics of the stream S, namely, the values

d_{\min}

and

d_{\max}

, representing the minimum and the maximum distance of two distinct points of S and the doubling dimension D of S.

Our main results are listed below.

A sliding window algorithm for general metric spaces that, at any time, is able to return a set of centers covering all but at most $z (1 + λ)$ points of the current window W, within a radius that is an $O (1)$ factor larger than the optimal radius for z ouliers. The algorithm requires a working memory of size $O (\log (d_{\max} / d_{\min}) (k + z) \log_{1 + λ} (| W |))$ and processes each point in time linear in the working memory size. By setting $λ = 1 / (2 z)$ , the number of uncovered points becomes, at most, z;
An improved algorithm with the same coverage guarantee as above, featuring a radius that is only a factor $(3 + O (δ))$ larger than the optimal radius, at the expense of an extra $O ({(c / δ)}^{D})$ factor in both the working memory size and update time, for a suitable constant c;
A sliding-window algorithm for streams of bounded doubling dimension that, starting from a (possibly crude) lower bound on the ratio between the $α$ -effective and the full diameter of the window W, returns upper and lower upper bounds to the $α$ -effective diameter of W. The algorithm features accuracy–space tradeoffs akin to those of the improved algorithm for 1-center with $z = 0$ outliers;
Experimental evidence that both the improved k-center and the effective diameter algorithms feature a good performance and provide accurate solutions.

It is important to remark that our algorithms are fully oblivious to the metric parameters

d_{\min}

,

d_{\max}

, and D, in the sense that the actual values of these parameters only influence the analysis but are not needed for the algorithms to run. This is a very desirable feature, since, in practice, these values are difficult to estimate.

Compared to the algorithm of [23] for k-center with z outliers, our algorithms feature a considerably lower update time, which makes them practically viable. Moreover, in the case of noisy streams for which

z = Ω (\log | W |)

, our algorithms require considerably less working memory, as long as some slackness in the number of outliers can be tolerated. Finally, whereas the algorithm of [23] requires the knowledge of

d_{\min}

and

d_{\max}

, our algorithms are oblivious to these values.

1.3. Organization of the Paper

The rest of the paper is structured as follows. Section 2 provides preliminary definitions. Section 3 and Section 4 present, respectively, the algorithms for the k-center problem with z outliers and for the effective diameter. Section 5 reports on the experimental results. Section 6 concludes the paper with some final remarks and pointers to relevant open problems.

2. Preliminaries

Consider a (possibly unbounded) stream S of points from some metric space with distance function

dist (\cdot, \cdot)

. At any time t, let W denote the set of the last

N = | W |

points arrived, for a fixed window length N. In the sliding window model, for a given computational problem, we aim to develop algorithms that, at any time t, are able to solve the instance represented by the current window W, using working memory considerably smaller than N (possibly constant or logarithmic in N).

2.1. Definition of the Problems

For any point

p \in W

and any subset

C \subseteq W

, we use the notation

dist (p, C) = \min_{q \in C} dist (p, q)

, and define the radius of C with respect to W as

r_{C} (W) = \max_{p \in W} dist (p, C) .

For a positive integer

k < | W |

, the k-center problem requires finding a subset

C \subseteq W

of k centers that minimizes

r_{C} (W)

. For a given W and k, we denote the radius of the optimal solution of this problem by

r_{k}^{*} (W)

. Given any radius value r, a subset

C \subseteq W

with

r_{C} (W) \leq 2 r

can be incrementally built using the greedy strategy of [13]: starting from an arbitrary center, a new center selected among the points of W at a distance

> 2 r

from the current centers is iteratively added to C until all points of W are at a distance of at most

2 r

from C. An easy argument shows that, if

r \geq r_{k}^{*} (W)

, the set C obtained in this fashion has a size of, at most, k. By combining this strategy with a suitable guessing protocol, a 2-approximate solution to the k-center problem for W is obtained.

Note that any subset

C \subseteq W

induces a partition of W into

| C |

clusters by assigning each point to its closest center (with ties broken arbitrarily).

In this paper, we focus on the following important extension to the k-center problem. For positive

k, z < | W |

, the k-center problem with z outliers requires finding a subset

C \subseteq W

of size k minimizing

r_{C} (W - Z_{C})

, where

Z_{C}

is the set of z points in W with the largest distances from C, which are regarded as outliers to be discarded from the clustering. We denote the radius of the optimal solution of this problem by

r_{k, z}^{*} (W)

. Observe that the k-center problem with z outliers reduces to the k-center problem for

z = 0

. In addition, it is straightforward to argue that the optimal solution of the k-center problem (without outliers) with

k + z

centers has a radius not larger than the optimal solution of the problem with k centers and z outliers, that is,

r_{k + z}^{*} (W) \leq r_{k, z}^{*} (W) .

(1)

In a more general formulation of the k-center problem with z outliers, each point

p \in W

carries a positive integer weight

w (p)

, and the desired set C of k centers must minimize

r_{C} (W - Z_{C})

, where

Z_{C}

is the set of points with the largest distances from C of maximum cardinality and an aggregate weight of, at most, z. We will refer to this weighted formulation as k-center with z weighted outliers.

The algorithms presented in this paper for k-center with z outliers crucially rely on the extraction of a succinct coreset T from the (possibly large) input W, so a solution to the problem can be efficiently computed by running a sequential algorithm on T rather than on W. The quality of a coreset T is captured by the following definition.

Definition 1.

Given a pointset W and a value

ε > 0

, a subset

T \subseteq W

is an

ε

-coreset for W w.r.t. the k-center problem with z outliers if

\max_{p \in W} dist (p, T) \leq ε r_{k, z}^{*} (W)

.

An

ε

-coreset T of W ensures that each point in W is “represented” by a close enough point in T, where closeness is defined w.r.t.

ε

and

r_{k, z}^{*} (W)

. In fact, our algorithms will make use of weighted coresets, where, additionally, each coreset point

p \in T

features a weight that is (an approximation of) the number of points of W represented by p.

An important characteristic of a pointset W is its diameter, defined as

Δ_{W} = \max_{p, q \in W} dist (p, q)

, which can be computed exactly in quadratic time. The diameter is very sensitive to noise in the dataset and, in the presence of outliers, its value might turn out to be scarcely representative of most pairwise distances in W. Thus, the more robust notion of the effective diameter has been introduced in [26]. Let

d_{1, W}, d_{2, W}, \dots

be an enumeration of the

{| W |}^{2}

distances between all pairs of points of W in non-decreasing order. For a given parameter

α \in (0, 1)

, the α-effective diameter of W is defined as

Δ_{W}^{α} = d_{⌈ α | W |^{2} ⌉, W}

, namely, the smallest value such that at least

{α | W |}^{2}

pairs of points in W are within distance

Δ_{W}^{α}

.

2.2. Doubling Dimension

The analysis of our algorithms will be carried out as a function of a number of relevant parameters, including the dimensionality of the data. To deal with arbitrary metric spaces, we resort to the following well-established general notion of dimensionality. For any

x \in W

and

r > 0

, the ball of radius r centered at x, denoted as

B (x, r)

, is the subset of all points of W at a distance of, at most, r from x. The doubling dimension of W is the minimum value D such that, for all

x \in W

, any ball

B (x, r)

is contained in the union of, at most,

2^{D}

balls of radius

r / 2

centered at points of W. The notion of doubling dimension has been used extensively in previous works (see [22,28] and the references therein).

3. $k$ -Center with $z$ Outliers

Let S be a (possibly unbounded) stream of points from some metric space, and let N be the selected window length. For any point

p \in S

, its Time-To-Live

TTL (p)

is N when p arrives, and it decreases by 1 at each subsequent step. We say that p is

a c t i v e

when

TTL (p) > 0

, and that it

e x p i r e s

when

TTL (p)

becomes 0. For convenience, the analysis will also consider expired points with negative TTLs. At any time t, the current window W consists of all arrived points with positive TTL; hence,

| W | = N

.

In this section, we present coreset-based algorithms that, at any time t, are able to return accurate approximate solutions to the k-center problem with z outliers for W. The section is structured as follows. Section 3.1 describes and analyzes the weighted coreset construction. Section 3.2 discusses how to extract the final solution from the weighted coreset whose radius is, at most, a constant factor away from

r_{k, z}^{*} (W)

, as long as a slightly larger number of outliers is tolerated. Section 3.3 shows how to remove an assumption made to simplify the presentation. Finally, Section 3.4 shows that, for spaces of bounded doubling dimension, the approximation factor can be lowered to a mere

3 + ε

for any fixed

ε

at the expense of larger working memory requirements.

3.1. Weighted Coreset Construction

3.1.1. Algorithm

The proposed coreset construction hinges upon the approach by [21] for k-center without outliers, with major extensions introduced to maintain weights. Let

d_{\min}

and

d_{\max}

denote, respectively, the minimum and maximum distances between any two distinct points of the stream. For a user-defined constant

β \in (0, 1]

, let

Γ = {{(1 + β)}^{i} : ⌊ \log_{1 + β} d_{\min} ⌋ \leq i \leq ⌈ \log_{1 + β} d_{\max} ⌉},

The values in

Γ

will be used as guesses of the optimal radius

r_{k + z}^{*} (W)

of a

(k + z)

-center clustering without outliers of the current window (recall that

r_{k + z}^{*} (W)

is a lower bound to the optimal radius

r_{k, z}^{*} (W)

for the problem with z outliers), and the algorithm will maintain suitable data structures capable of identifying the right guess. For ease of presentation, we assume for now that

d_{\min}

and

d_{\max}

are known to the algorithm. In Section 3.3, we will show how the assumption can be removed by maintaining estimates of the two values. For each guess

γ

, the algorithm maintains three sets of active points, namely

A_{γ}

,

R_{γ}

, and

O_{γ}

, and the coreset is extracted from these sets.

A_{γ}

is a small set of active points, called attraction points, such that, for any two distinct

a_{1}, a_{2} \in A_{γ}

,

dist (a_{1}, a_{2}) > 2 γ

.

At every time t, the arrival of a new point p is handled as follows, for every guess

γ

. If there exist attraction points a such that

dist (p, a) \leq 2 γ

, we define

a_{γ} (p)

as the one with minimum TTL, and say that p is attracted by

a_{γ} (p)

. Otherwise, p becomes a new attraction point in

A_{γ}

, and we let

a_{γ} (p) = p

(i.e., p is attracted by itself). Set

R_{γ}

maintains, for each

a \in A_{γ}

, one representative

r_{γ} (a)

, defined as the most recent point attracted by a. Note that, whereas

a_{γ} (p)

is fixed at p’s arrival, the representative

r_{γ} (a)

may change with time. When an attraction point a expires, its representative

r_{γ} (a)

becomes an orphan and is moved to the set

O_{γ}

.

Since the pairwise distance between the points of

A_{γ}

is

> 2 γ

, if

| A_{γ} | \geq k + z + 1

, it is clear that

γ < r_{k + z}^{*} (W)

, and, as will be seen below, the points in

A_{γ} \cup R_{γ} \cup O_{γ}

will not be used to extract the coreset. Therefore, to save memory, we set

k + z + 1

as a threshold for

| A_{γ} |

: when

| A_{γ} | = k + z + 1

and the newly arrived point qualifies to be an attraction point, the algorithm discards the point

a \in A_{γ}

with minimum TTL, and moves its representative

r_{γ} (a)

to

O_{γ}

. As further space saving, all points in

O_{γ}

older than a are discarded, since throughout their residual lifespan,

| A_{γ} | \geq k + z + 1

, and hence they cannot contribute to a valid coreset.

At any time t, the coreset for the k-center problem with z outliers, w.r.t. the current window W, is obtained as

T = R_{\hat{γ}} \cup O_{\hat{γ}}

, where

\hat{γ}

is the smallest guess such that: (i)

| A_{\hat{γ}} | \leq k + z

; and (ii) by running the simple greedy strategy of [13], reviewed in Section 2.1, a set C of

k + z

points can be selected from

A_{\hat{γ}} \cup R_{\hat{γ}} \cup O_{\hat{γ}}

, such that any other point in this set is at a distance of, at most,

2 \hat{γ}

from a selected point. To ensure that an accurate solution to the k-center problem with z outliers can be extracted from the coreset T, we need to weigh each point

p \in T

with (a suitable accurate estimate of) the number of window points for which p can act as a proxy. This requires maintaining additional information with the points of the various sets

R_{γ}

and

O_{γ}

, as explained below.

For each guess

γ

and each active point

p \in W

, we define its proxy

π_{γ} (p) \in R_{γ} \cup O_{γ}

as the most recent active point r such that both p and r are attracted by

a_{γ} (p)

. (Note that r may be an orphan if

a_{γ} (p)

was discarded from

A_{γ}

.) Thus,

π_{γ} (p) = r_{γ} (a_{γ} (p))

. Therefore, at any time t, the proxy function

π_{γ} (\cdot)

defines a mapping between active points and points of

O_{γ} \cup R_{γ}

, and, for every

r \in O_{γ} \cup R_{γ}

we define its weight

w_{γ} (r) = | {p \in W : π_{γ} (p) = r} |

. For each

r \in R_{γ} \cup O_{γ}

, our algorithm maintains a histogram

L_{r, γ} = {(t_{r, 1}, c_{r, 1}), (t_{r, 2}, c_{r, 2}), \dots}

, which is a list of pairs (timestamp, weight) such that there are

c_{r, j}

points p assigned to r (i.e., for which

π_{γ} (p) = r

) that arrived at or after time

t_{r, j}

. When a point p arrives at time t, the histograms are updated as follows (for the sake of readability, from now on, we drop the subscript

γ

from histograms and weights, when clear from the context.). If p becomes a new attraction point (hence,

p = a_{γ} (p) = r_{γ} (p)

), a new histogram

L_{p} = {(t, 1)}

is created and is assigned to p. If, instead, p is attracted by some

a \in A_{γ}

, then p becomes the representative

r_{γ} (a)

and inherits the histogram from the previous representative, modified by increasing all weights by 1 and adding the new pair

(t, 1)

. In addition, all histogram entries with timestamp

t - | W |

are discarded, since they refer to the point that expired at time t. The pairs in each histogram are naturally sorted by an increasing order of timestamps and decreasing order of weight. Observe that, when a representative r becomes an orphan, and hence is moved from

R_{γ}

to

O_{γ}

, its histogram

L_{r}

does not acquire new entries until r expires.

At any time t, for a histogram

L_{r}

, we denote by

c_{L_{r}}

the weight of the pair in

L_{r}

with the smallest timestamp (which is greater than or equal to

t - | W | + 1

by virtue of the elimination of the old entries described above). It is easy to see that

c_{L_{r}} = w (r)

, that is, the number of points p for which r is the proxy. Unfortunately, keeping the full histogram

L_{r}

for each

r \in R_{γ} \cup O_{γ}

requires a working memory of size

Θ (| W |)

, which is far beyond the space bound targeted by sliding window algorithms. Therefore, taking inspiration from the smooth histograms of [27], we maintain in

L_{r}

only a trimmed version of the full list, which, however, ensures that

c_{L_{r}}

is an estimate of

w (r)

with a controlled level of accuracy. Specifically, let

λ > 0

be a user-defined accuracy parameter. Every time a histogram

L_{r}

is updated, a scan of the pairs is performed that implements the following trimming:

The first pair $(t, c)$ is kept in the histogram;
If a pair $(t, c)$ is kept in the histogram, all subsequent pairs $(t^{'}, c^{'})$ with $t^{'} > t$ and $c \leq (1 + λ) c^{'}$ are deleted, except for the last such pair, if any.

At any time t, the weighted coreset that will be used to solve the k-center problem with z outliers for the current window W consists of the set

T = R_{\hat{γ}} \cup O_{\hat{γ}}

, where the guess

\hat{γ}

is computed as described above, and each

r \in T

is assigned weight

\tilde{w} (r) = c_{L_{r}}

. As will be shown in the next subsection, the

\tilde{w} (r)

s are good approximations of the true weights

w (r)

s, and this will provide good bi-criteria approximation quality for the returned solution.

The pseudocode detailing the algorithm is provided below. The arrival of a new point p at time t is handled by the main Procedure update

(p, t)

(Algorithm 1), which, for every guess

γ

, invokes, in turn, Procedure insertAttraction

(p, γ)

(Algorithm 2) when p must be added to

A_{γ}

, or Procedure updateHistrograms

(L_{p, γ})

(Algorithm 3) when p becomes a new representative of some existing point of

A_{γ}

. To extract the weighted coreset, Procedure extractCoreset

()

(Algorithm 4) is executed.

Algorithm 1: update(p, t)

Algorithm 2: insertAttraction(p, γ)

Algorithm 3: updateHistrograms(L)

Algorithm 4: extractCoreset( )

3.1.2. Analysis

The following two technical lemmas state important properties of the sets

A_{γ}

,

O_{γ}

, and

R_{γ}

, and of the histograms maintained by the algorithm.

Lemma 1.

At any time t, the following properties hold for every

γ \in Γ

:

1.: If $| A_{γ} | \leq k + z$ , then $\max_{q \in W} dist (q, π_{γ} (q)) \leq 4 γ$ .
2.: $| A_{γ} |, | R_{γ} |, | O_{γ} | \leq k + z + 1$ .

Proof.

In order to prove the lemma, it suffices to show that, if the properties hold after the processing of the

(t - 1)

-th point, they are inductively maintained after the invocation of update

(p, t)

. The proof makes use of essentially the same arguments employed in [21] [Lemmas 7, 8], straightforwardly adapted to account for the fact that, in our algorithm, each new point that does not become an attraction point is made representative of a single attraction point, whereas, in [21], it would be made representative of all attraction points at a distance of, at most,

2 γ

. □

Recall that, at any time t and for any guess

γ

, the histogram

L_{r}

associated with each point

r \in R_{γ} \cup O_{γ}

is a list of pairs

(t_{r, i}, c_{r, i})

indicating that there are currently

c_{r, i}

points that have arrived after time

\geq t_{r, i}

, whose proxy is r, for

i = 1, 2, \dots

. We have:

Lemma 2.

For any time t, guess γ, and

r \in R_{γ} \cup O_{γ}

, the following properties hold for

L_{r}

.

1.: For every $1 \leq i \leq | L_{r} |$ , $c_{r, i} \leq | W |$ ;
2.: For every $1 \leq i \leq | L_{r} | - 1$ , $c_{r, i} \leq (1 + λ) c_{r, i + 1}$ or $c_{r, i} = 1 + c_{r, i + 1} > (1 + λ) c_{r, i + 1}$ ;
3.: For every $1 \leq i \leq | L_{r} | - 2$ , $c_{r, i} > (1 + λ) c_{r, i + 2}$ ;
4.: $| L_{r} | \in O (\log_{1 + λ} | W |)$ .

Proof.

First, observe that Property 4 is an immediate consequence of Properties 1 and 3. Hence, we are left with proving Properties 1, 2, and 3. The properties clearly hold when the histogram

L_{r}

is first created (Line 12 of update); thus, we only need to show that if they hold prior to an invocation of updateHistogram(Lr), they continue to hold for the histogram M created by updateHistogram(Lr), which becomes the new

L_{r}

at the end of the procedure. Property 1 holds since the weights are updated only as long as

a_{γ} (r)

, the oldest point accounted for in the histogram, is active; hence, weights always represent sizes of subsets of points of the same window, and thus they never exceed

| W |

. Consider now Properties 2 and 3. Let

L_{r}^{'}

denote the histogram

L_{r}

after the execution of Line 3 of updateHistogram(Lr) (i.e., after the increment of the weights and the addition of

(t, 1)

at the end of the list). It is easy to argue that Properties 2 and 3 continue to hold for

L_{r}^{'}

. Let us now show that they also hold for histogram M at the end of the procedure. As for Property 2, consider two adjacent pairs

(t_{r, i}, c_{r, i})

and

(t_{r, i + 1}, c_{r, i + 1})

in M, with

c_{r, i} > (1 + λ) c_{r, i + 1}

. Then, two pairs with the same timestamps and weights must exist in

L_{r}^{'}

, and we can argue that these two pairs must be adjacent in

L_{r}^{'}

, and hence their weights must differ by 1, since Property 2 holds for

L_{r}^{'}

. Indeed, if the two pairs were not adjacent in

L_{r}^{'}

, then at least one pair with a timestamp between

t_{r, i}

and

t_{r, i + 1}

must have been removed in the for loop of updateHistogram(Lr). However, this is not possible, because, due to the way the loop operates, this would ensure that

c_{r, i} \leq (1 + λ) c_{r, i + 1}

. Finally, Property 3 is enforced by the for loop of updateHistogram(Lr). □

The following theorem states the main properties of the weighted coreset T computed by our algorithm.

Theorem 1.

At any time t, the weighted coreset T returned byextractCoreset

()

is a

4 (1 + β)

-coreset of size

O (k + z)

for the current window W w.r.t. the k-center problem with z outliers. Moreover, for each point

r \in T

, we have

w (r) / (1 + λ) \leq \tilde{w} (r) \leq w (r)

.

Proof.

Recall that

T = R_{\hat{γ}} \cup O_{\hat{γ}}

, where

\hat{γ}

is the minimum guess for which the following two conditions are verified:

| A_{\hat{γ}} | \leq k + z

, and the greedy selection strategy of [13] applied to

A_{\hat{γ}} \cup R_{\hat{γ}} \cup O_{\hat{γ}}

returns a set C of

k + z

points such that

dist (p, C) \leq 2 \hat{γ}

, for every

p \in A_{\hat{γ}} \cup R_{\hat{γ}} \cup O_{\hat{γ}}

. It is easy to see that these two conditions are surely verified for any

γ \geq r_{k + z}^{*} (W)

. Therefore, considering the density of the guesses in

Γ

, we must have

\hat{γ} \leq (1 + β) r_{k + z}^{*} (W)

. By Lemma 1, we have

| T | = O (k + z)

and

\max_{p \in W} dist (p, T) \leq 4 \hat{γ}

, which implies

\max_{p \in W} dist (p, T) \leq 4 (1 + β) r_{k + z}^{*} (W) \leq 4 (1 + β) r_{k, z}^{*} (W)

.

We now show the relation concerning the approximate weights

\tilde{w} (r)

. Recall that, for each

r \in T

, the value

\tilde{w} (r)

is set equal to

c_{L_{r}}

, which is the weight component of the pair in

L_{r}

with the smallest timestamp

\geq t - | W | + 1

. Let

(t_{r, i}, c_{r, i})

be the pair of

L_{r}

such that

c_{r, i} = c_{L_{r}}

. If

i > 1

, then the relation

c_{r, i} \leq w (r) \leq c_{r, i - 1}

must hold, and Property 2 of Lemma 2 ensures that

c_{r, i - 1} \leq c_{r, i} (1 + λ)

or

c_{r, i - 1} = c_{r, i} + 1

. In the former case, we have

w (r) / (1 + λ) \leq c_{r, i} = c_{L_{r}} \leq w (r)

, whereas, in the latter case, it is easy to see that

c_{L_{r}} = w (r)

. If, instead,

i = 1

, it is easy to see that

c_{L_{r}} = w (r)

, since the first pair of the histogram is always relative to the arrival of the attraction point

a_{γ} (r)

, and hence the weight of such pair accounts for all points assigned to it. □

The following theorem analyzes the space and time performance of our coreset construction strategy.

Theorem 2.

The data structures used by our coreset construction strategy require a working memory of size

O (\log_{1 + β} (d_{\max} / d_{\min}) \cdot (k + z) \cdot \log_{1 + λ} (| W |))

. Moreover, Procedureupdatecan be implemented to run in time

O (\log_{1 + β} (d_{\max} / d_{\min}) \cdot ((k + z) + \log_{1 + λ} (| W |))),

whereas ProcedureextractCoresetcan be implemented to run in time

O (\log_{1 + β} (d_{\max} / d_{\min}) \cdot {(k + z)}^{2}) .

If binary search is used to find the value

\hat{γ}

, the running time ofextractCoresetdecreases to

O (\log (\log_{1 + β} (d_{\max} / d_{\min})) \cdot {(k + z)}^{2}) .

Proof.

The bound on the working memory is an immediate consequence of Lemmas 1 and 2, and of the fact that

| Γ | = O (\log_{1 + β} (d_{\max} / d_{\min}))

. Procedure update is easily implemented through a constant number of linear scans of

A_{γ}

,

R_{γ}

, and

O_{γ}

, for every

γ \in Γ

, and a linear scan of, at most, one histogram. For what concerns Procedure extractCoreset, for each

γ \in Γ

, the computation of C requires time quadratic in

| A_{γ} | + | R_{γ} | + | O_{γ} |

, and the number of guesses

γ \in Γ

to be checked are at most

| Γ |

if a linear search is used, and

O (\log (| Γ |))

if binary search is used. In addition, time

O (| T |)

is needed to compute

\tilde{w} (r)

for all

r \in T

. Based on these considerations, the complexity bounds follow again from Lemmas 1 and 2 and the size of

| Γ |

. □

3.2. Computation of the Solution from the Coreset

At any time t, to compute a solution to the k-center problem with z outliers on the window W, we first determine a weighted coreset T, as explained in the previous subsection, and then, on T, we run a sequential strategy for the weighted variant of the problem. To this purpose, we make use of the algorithm developed in [8,9], which generalizes the sequential algorithm of [7] to the weighted case. Given as an input a weighted set T, a value k, a precision parameter

ε

, and a radius

ρ

, the algorithm computes a set X of k centers incrementally in, at most, k iterations. Initially, all points are considered uncovered. At each iteration, the next center is selected as the point that maximizes the aggregate weight of the yet uncovered points within distance

(1 + 2 ε) ρ

, and such a center covers all points in T within the larger distance

(3 + 4 ε) ρ

. The algorithm terminates when either

| X | = k

or no uncovered points remain, returning X and the subset

T^{'}

of uncovered points. This algorithm is implemented by procedure outliersCluster

(T, k, ρ, ε)

. The next lemma states an important property related to the use of outliersCluster (Algorithm 5) in our context.

Algorithm 5: outliersCluster(T, k, ρ, ε)

Lemma 3.

For any time t, let

T = R_{\hat{γ}} \cup O_{\hat{γ}}

be the weighted coreset returned byextractCoreset

()

. For any

ρ \geq r_{k, z}^{*} (W)

, the invocationoutliersCluster

(T, k, ρ, ε)

with

ε = 4 (1 + β)

returns a set of centers X and a set of uncovered points

T^{'} \subseteq T

, such that

$dist (r, X) \leq (3 + 4 ε) ρ \forall r \in T \ T^{'}$
$\sum_{r \in T^{'}} \tilde{w} (r) \leq z$ .

Proof.

The proof can be obtained as a simple technical adaptation of the one of [9] [Lemma 5] to the case of approximate weights. For clarity, we detail the entire proof rather than highlighting the changes only. The bound on

dist (r, X)

for every

r \in T ∖ T^{'}

is directly enforced by the algorithm. We are left to show that

\sum_{r \in T^{'}} \tilde{w} (r) \leq z

. If

| X | < k

, then

T^{'} = \emptyset

and the claim holds trivially. We now concentrate on the case

| X | = k

. Consider the i-th iteration of the while loop of outliersCluster

(T, k, ρ, ε)

and define

x_{i}

as the center of X selected in the iteration, and

T_{i}^{'}

as the set

T^{'}

of uncovered points at the beginning of the iteration. Recall that

x_{i}

is the point of T that maximizes the cumulative approximate weight of the set

B_{x_{i}}

of uncovered points in

T_{i}^{'}

at a distance of, at most,

(1 + 2 ε) \cdot ρ

from

x_{i}

, and that the set

E_{x_{i}}

of all uncovered points at a distance of, at most,

(3 + 4 ε) \cdot ρ

from

x_{i}

is removed from

T_{i}^{'}

at the end of the iteration. Let

σ_{T} = \sum_{r \in T} \tilde{w} (r) .

We now show that

\sum_{i = 1}^{k} \sum_{r \in E_{x_{i}}} \tilde{w} (r) \geq σ_{T} - z,

(2)

which will immediately imply that

\sum_{r \in T^{'}} \tilde{w} (r) \leq z

. To this purpose, let O be an optimal set of k centers for the current window W, and let Z be the set of, at most, z outliers at a distance greater than

r_{k, z}^{*} (W)

from O. Let also

\tilde{W}

be a subset of the current window that, for every

r \in T

, contains exactly

\tilde{w} (r)

points q, including r, for which r is a proxy, that is, points

q \in W

such that

π_{\hat{γ}} (q) = r

. Note that

\tilde{W}

is well defined since, by Theorem 1,

\tilde{w} (r)

is always less than or equal to the actual weight of r. For each

o \in O

, define

C_{o} \subseteq \tilde{W} ∖ Z

as the set of nonoutlier points in

\tilde{W}

that are closer to o than to any other center of O, with ties broken arbitrarily. It is important to remark, that while there may be some optimal center

o \in O

that is not in

\tilde{W}

, its proxy

π_{\hat{γ}} (o)

is in T; hence, it is guaranteed to be in

\tilde{W}

. To prove Equation (2), it is sufficient to exhibit an ordering

o_{1}, o_{2}, \dots, o_{k}

of the centers in O so that, for every

1 \leq i \leq k

, it holds

\sum_{j = 1}^{i} \sum_{r \in E_{x_{j}}} \tilde{w} (r) \geq | C_{o_{1}} \cup \dots \cup C_{o_{i}} | .

The proof uses an inductive charging argument to assign each point in

⋃_{j = 1}^{i} C_{o_{j}}

to a point in

⋃_{j = 1}^{i} E_{x_{j}}

, where each r in the latter set will be in charge of, at most,

\tilde{w} (r)

points. We define two charging rules. A point can be either charged to its own proxy (Rule 1) or to another point of T (Rule 2).

Fix some arbitrary i, with

1 \leq i \leq k

, and assume, inductively, that the points in

C_{o_{1}} \cup \dots \cup C_{o_{i - 1}}

have been charged to points in

⋃_{j = 1}^{i - 1} E_{j}

for some choice of distinct optimal centers

o_{1}, o_{2}, \dots, o_{i - 1}

. We have two cases.

Case 1.

There exists an optimal center o still unchosen such that there is a point

v \in C_{o}

with

π_{\hat{γ}} (v) \in B_{x_{j}}

, for some

1 \leq j \leq i

. We choose

o_{i}

as one such center. Hence,

d (x_{j}, π_{\hat{γ}} (v)) \leq (1 + 2 ε) \cdot ρ

. By repeatedly applying the triangle inequality, we have that, for each

u \in C_{o_{i}}

,

\begin{matrix} d (x_{j}, π_{\hat{γ}} (u)) \leq & d (x_{j}, π_{\hat{γ}} (v)) + d (π_{\hat{γ}} (v), v) + d (v, o_{i}) \\ + d (o_{i}, u) + d (u, π_{\hat{γ}} (u)) \leq (3 + 4 ε) \cdot ρ \end{matrix}

hence,

π_{\hat{γ}} (u) \in E_{x_{j}}

. Therefore we can charge each point

u \in C_{o_{i}}

to its proxy by Rule 1.

Case 2.

For each unchosen optimal center o and each

v \in C_{o}

,

π_{\hat{γ}} (v) \notin ⋃_{j = 1}^{i} B_{x_{j}}

. We choose

o_{i}

to be the unchosen optimal center that maximizes the cardinality of

{π_{\hat{γ}} (u) : u \in C_{o_{i}}} \cap T_{i}^{'}

. We distinguish between points

u \in C_{o_{i}}

with

π_{\hat{γ}} (u) \notin T_{i}^{'}

; hence,

π_{\hat{γ}} (u) \in ⋃_{j = 1}^{i - 1} E_{x_{j}}

, and those with

π_{\hat{γ}} (u) \in T_{i}^{'}

. We charge each

u \in C_{o_{i}}

with

π_{\hat{γ}} (u) \notin T_{i}^{'}

to its own proxy by Rule 1. As for the other points, we now show that we can charge them to the points of

B_{x_{i}}

. To this purpose, we first observe that

B_{π_{\hat{γ}} (o_{i})}

contains

{π_{\hat{γ}} (u) : u \in C_{o_{i}}} \cap T_{i}^{'}

, since, for each

u \in C_{o_{i}}

,

\begin{matrix} d (π_{\hat{γ}} (o_{i}), π_{\hat{γ}} (u)) & \leq d (π_{\hat{γ}} (o_{i}), o_{i}) + d (o_{i}, u) + d (u, π_{\hat{γ}} (u)) \\ \leq (1 + 2 ε) \cdot r_{k, z}^{*} (W) \leq (1 + 2 ε) \cdot ρ . \end{matrix}

Therefore, the aggregate approximate weight of

B_{π_{\hat{γ}} (o_{i})}

is at least

|\{u \in C_{o_{i}} : π_{\hat{γ}} (u) \in T_{i}^{'}\}|

. Since Iteration i selects

x_{i}

as the center, such that

B_{x_{i}}

has maximum aggregate approximate weight, we have that

\sum_{r \in B_{x_{i}}} \tilde{w} (r) \geq \sum_{r \in B_{π_{\hat{γ}} (o_{i})}} \tilde{w} (r) \geq |\{u \in C_{o_{i}} : π_{\hat{γ}} (u) \in T_{i}^{'}\}|,

hence, the points

u \in C_{o_{i}}

with

π_{\hat{γ}} (u) \in T_{i}^{'}

can be charged to the points in

B_{x_{i}}

, since

B_{x_{i}}

has enough aggregate approximate weight.

Note that the points of

B_{x_{i}}

did not receive any charging by Rule 1 in previous iterations, since they are uncovered at the beginning of Iteration i, and will not receive chargings by Rule 1 in subsequent iterations, since

B_{x_{i}}

does not intersect the set

C_{o}

of any optimal center o yet to be chosen. In addition, no further charging to the points of

B_{x_{i}}

by Rule 2 will happen in subsequent iterations, since Rule 2 will only target sets

B_{x_{h}}

with

h > i

. These observations ensure that any point of T receives charges through either Rule 1 or Rule 2, but not both, and never in excess of its weight, and the proof follows. □

At any time t, to obtain the desired solution, we invoke Procedure computeSolution, which works as follows (see Algorithm 6 for the pseudocode). The procedure first extracts the coreset

T = R_{\hat{γ}} \cup O_{\hat{γ}}

calling extractCoreset. Then, it sets

ε = 4 (1 + β)

and runs outliersCluster

(T, k, ρ, ε)

for a geometric sequence of values of

ρ

between

d_{\min}

and

d_{\max}

, with step

1 + β

, stopping at the minimum value

ρ_{m i n}

for which the pair

(X, T^{'})

returned by outliersCluster

(T, k, ρ_{m i n}, ε)

is such that the aggregate approximate weight of set

T^{'}

is, at most, z. (Note that the parameter

β

in the definitions of

ε

and of the step used in the geometric search for

ρ_{m i n}

is the same one that appears in the definition of

Γ

; hence,

β \in (0, 1]

.) At this point, the set of centers X is returned as a solution to the k-center problem with z outliers on the current window W.

Algorithm 6: computeSolution( )

The following theorem highlights the tradeoff between the accuracy (in terms of both the radius and excess number of outliers) and the performance exhibited by computeSolution.

Theorem 3.

At any time t,computeSolutionreturns a set

X \subseteq W

of, at most, k centers, such that at least

| W | - (1 + λ) z

points of W are at a distance of, at most,

(23 + 55 β) r_{k, z}^{*} (W)

from X. The procedure requires a working memory of size

O ((k + z) \log_{1 + β} (d_{\max} / d_{\min}) \log_{1 + λ} (| W |))

and runs in time

O (\log_{1 + β} (d_{\max} / d_{\min}) \cdot k {(k + z)}^{2}) .

If the while loop is substituted by a binary search for

ρ_{m i n}

, the running time decreases to

O (\log (\log_{1 + β} (d_{\max} / d_{\min})) \cdot k {(k + z)}^{2}) .

Proof.

Let

ρ_{m i n}

be such that outliersCluster

(T, k, ρ_{m i n}, ε)

yields

(X, T^{'})

, where X is the returned solution, and let

W^{'}

be the set of points of W whose proxies are in

T^{'}

. By Lemma 3 and the choice of the step of the geometric search for

ρ_{m i n}

, we have that

ρ_{m i n} \leq (1 + β) r_{k, z}^{*} (W)

; hence, for each

p \in W - W^{'}

,

dist (π_{\hat{γ}} (p), X) \leq (3 + 4 ε) ρ_{m i n} \leq (3 + 4 ε) (1 + β) r_{k, z}^{*} (W)

. Since T is an

ε

-coreset (Theorem 1) and

β \in (0, 1]

, we conclude that, for every

p \in W - W^{'}

,

\begin{matrix} dist (p, X) & \leq & dist (p, π_{\hat{γ}} (p)) + dist (π_{\hat{γ}} (p), X) \\ \leq & ε r_{k, z}^{*} (W) + (3 + 4 ε) (1 + β) r_{k, z}^{*} (W) \\ < & (23 + 55 β) r_{k, z}^{*} (W) \end{matrix}

Moreover, by Theorem 1, we also have that

| W^{'} | \leq (1 + λ) \sum_{r \in T^{'}} \tilde{w} (r) \leq (1 + λ) z

. The working memory bound is an immediate consequence of Theorem 2, since the working memory is dominated by the data structures from which the coreset is extracted. For what concerns the running time, observe that outliersCluster can be easily implemented to run in time

O (k \cdot {| T |}^{2})

, which is

O (k \cdot {(k + z)}^{2})

, since

| T | = O (k + z)

by Theorem 1. Therefore, the bound on the running time follows, since extractCoreset requires time

O (\log (\log_{1 + β} (d_{\max} / d_{\min})) \cdot {(k + z)}^{2})

(by Theorem 2), and the while loop performs, at most,

O (\log_{1 + β} (d_{\max} / d_{\min}))

executions of outliersCluster, which can be lowered to

O (\log \log_{1 + β} (d_{\max} / d_{\min}))

using binary search. □

The above result shows that allowing for a slight excess in the number of outliers governed by parameter

λ

results in improved space and time complexities. Note that if the upper bound z on the number of outliers must be rigidly enforced, it is sufficient to set

λ = 1 / (2 z)

. In this case,

(1 + λ) z = z + 1 / 2

, and since the number of outliers must be an integer, it cannot be larger than z. The following corollary is an immediate consequence of Theorem 3, of this observation, and of the fact that

\log_{1 + λ} x = Θ ((1 / λ) \log x)

for

λ \in (0, 1)

.

Corollary 1.

Let

λ = 1 / (2 z)

. At any time t,computeSolutionreturns a set

X \subseteq W

of, at most, k centers, such that at least

| W | - z

points of W are at a distance of, at most,

(23 + 55 β) r_{k, z}^{*} (W)

from X. The procedure requires a working memory of size

O (z (k + z) \log_{1 + β} (d_{\max} / d_{\min}) \log (| W |))

and runs in time

O (\log_{1 + β} (d_{\max} / d_{\min}) \cdot k {(k + z)}^{2}) .

If the while loop is substituted by a binary search for

ρ_{m i n}

, the running time decreases to

O (\log (\log_{1 + β} (d_{\max} / d_{\min})) \cdot k {(k + z)}^{2}) .

3.3. Obliviousness to $d_{\min}$ and $d_{\max}$

The algorithm described in Section 3.1 and Section 3.2 requires the knowledge of the values

d_{\min}

and

d_{\max}

. In this subsection, we show how to remove this requirement by employing the techniques developed in [22], suitably extended to cope with histograms, which were not used in that work.

Let

p_{1}, p_{2}, \dots

be an enumeration of all points of the stream S based on their arrival times. For

t > k + z

, let

d_{t}

be the minimum pairwise distance between the last

k + z + 1

points of the stream (

p_{t - k - z}, \dots, p_{t - 1}, p_{t}

). Let also

D_{t}

be the maximum distance between

p_{1}

and any

p_{i}

with

i \leq t

, and note that, by the triangle inequality, the maximum pairwise distance among the first t points of S is upper bounded by

2 D_{t}

. It is easy to argue that

d_{t} / 2 \leq r_{k + z}^{*} (W) \leq r_{k, z}^{*} (W) \leq 2 D_{t}

. By storing

p_{1}

and the last

k + z + 1

points of the stream, the values

d_{t}

and

D_{t}

can be straightforwardly maintained by the algorithm with

O ({(k + z)}^{2})

operations per step. We define

Γ_{t} = {{(1 + β)}^{i} : ⌊ \log_{1 + β} d_{t} / 2 ⌋ \leq i \leq ⌈ \log_{1 + β} 2 D_{t} ⌉} .

Suppose that, at any time t, the algorithm maintains the sets

A_{γ}

,

R_{γ}

and

O_{γ}

, and the histograms for the points in

R_{γ} \cup O_{γ}

, only for

γ \in Γ_{t}

, and assume that the properties stated in Lemmas 1 and 2 hold for every

γ \in Γ_{t}

and

r \in R_{γ} \cup O_{γ}

. Then, by running procedure extractCoreset and limiting the search for

\hat{γ}

to the set

Γ_{t}

, we still obtain a

4 (1 + β)

-coreset for the current window. To see this, we first note that, based on the previous observation,

Γ_{t}

definitely includes a value

γ

with

r_{k + z}^{*} (W) \leq γ \leq (1 + β) r_{k + z}^{*} (W)

. By repeating the same argument used in the proof of Theorem 1, we can show that, for such a value of

γ

, we have that

| A_{γ} | \leq k + z

, and that the inner for loop of extractCoreset computes a set C of, at most,

k + z

points. This immediately implies that extractCoreset determines a guess

\hat{γ} \leq (1 + β) r_{k + z}^{*} (W)

and that the returned coreset

T = R_{\hat{γ}} \cup O_{\hat{γ}}

is a

4 (1 + β)

-coreset.

We now show how to modify the algorithm described in the previous subsections (referred to as full algorithm in what follows) to maintain, without the knowledge of

d_{\min}

and

d_{\max}

, the sets

A_{γ}

,

R_{γ}

, and

O_{γ}

and the required histograms, for every guess

γ \in Γ_{t}

. Suppose that this is the case up to some time

t - 1 > k + z

, and consider the arrival of

p_{t}

. Before invoking update

(p_{t}, t)

, the algorithm executes the operations described below.

First, the new values

d_{t}

and

D_{t}

are computed, and all sets relative to values of

γ \in Γ_{t - 1} - Γ_{t}

are removed. If

d_{t} < d_{t - 1}

, then, for each

γ \in Γ_{t}

with

γ < \min {γ^{'} \in Γ_{t - 1}} \leq d_{t - 1} / 2

, the algorithm sets

A_{γ} = {p_{t - k - z - 1}, \dots, p_{t - 1}} = R_{γ}

and

O_{γ} = \emptyset

. Moreover, for each

p_{τ} \in R_{γ}

, it sets

L_{p_{τ}} = {(τ, 1)}

, as each point represents itself only. Since any two points in

A_{γ}

are at a distance of at least

d_{t - 1} > 2 γ

, it is easy to see that these newly created data structures coincide with the ones that the full algorithm would store at time

t - 1

(for the same

γ

’s) if the stream started at time

t - (k + z + 1)

; hence, they satisfy the properties of Lemmas 1 and 2.

If

D_{t} > D_{t - 1}

, then for each

γ \in Γ_{t}

with

γ > \max {γ^{'} \in Γ_{t - 1}}

, the algorithm sets

A_{γ} = {p_{t - | W |}}

,

R_{γ} = {p_{t - 1}}

and

O_{γ} = \emptyset

. It is easy to see that these newly created sets coincide with the ones that the full algorithm would store at time

t - 1

(for the same

γ

’s) if the stream started at time

t - | W |

; hence, they satisfy the properties of Lemma 1. It has to be remarked that, although point

p_{t - | W |}

is not available at time

t - 1

, this is not a problem since the point immediately expires at time t and is removed from the data structures without even being used in the processing of

p_{t}

. Hence, in this context,

p_{t - | W |}

acts as a mere placeholder. For what concerns the histogram

L_{p_{t - 1}}

to associate with

p_{t - 1}

(for every

γ > \max {γ^{'} \in Γ_{t - 1}}

), its exact version represents the entire active window, and hence it would be the list

{(t - | W |, | W |), \dots, (t - 2, 2), (t - 1, 1)}

. In order to satisfy the properties of Lemma 2,

L_{p_{t - 1}}

can be trimmed as follows. Let

f (x) = ⌈\frac{x}{1 + λ}⌉

. Then,

L_{p_{t - 1}}

contains the set of pairs

(t - c_{i}, c_{i})

, with

i \geq 0

, where the

c_{i}

s form a decreasing sequence of values

\geq 1

such that

c_{0} = | W |

and, for each

i \leq 0

with

c_{i} > 1

,

c_{i + 1} = \min {c_{i} - 1, f (c_{i})}

.

Lemma 4.

The list

L_{p_{t - 1}}

defined above satisfies the properties stated in Lemma 2.

Proof.

Property 1 clearly holds. As for Property 2, consider two consecutive pairs

(t - c_{i}, c_{i})

and

(t - c_{i + 1}, c_{i + 1})

. If

c_{i + 1} = f (c_{i})

, then it is certain that

c_{i} \leq (1 + d) c_{i + 1}

. Hence, when

c_{i} > (1 + λ) c_{i + 1}

, we must have

c_{i + 1} = c_{i} - 1

, and the property follows. Property 3 holds, since, for

1 \leq i \leq | L_{r} | - 2

,

c_{i + 2} \leq c_{i + 1} - 1 < (c_{i} / (1 + λ) + 1) - 1

, and hence

(1 + λ) c_{i + 2} \leq c_{i}

. Finally, the bound stated by Property 4 follows as a direct consequence of the first three properties. □

Once the correct configurations of the data structures for all guesses

γ \in Γ_{t}

are obtained, Procedure

u p d a t e (p_{t}, t)

is invoked to complete step t, thus enforcing the properties of Lemmas 1 and 2 for all of these data structures.

3.4. Improved Approximation under Bounded Doubling Dimension

Consider a stream S of doubling dimension D. We now outline an improved coreset construction that is able to provide a

δ

-coreset T for the k-center problem with z outliers, for any given

δ > 0

, at the expense of a blow-up in the working memory size, which is analyzed as a function of D and is tolerable for small (e.g., constant) D. This improved construction allows us to obtain a much tighter approximation for the k-center problem with z outliers in the sliding window setting.

Fix any given

δ > 0

. In [22], a refinement of the k-center strategy of [21] is presented that, for every guess

γ

, maintains two families of attraction, representative, and orphan points. The first family, referred to as validation points, features three

O (k)

-sized sets of attraction, representative, and orphan points, equivalent to those described in Subection Section 3.1. Validation points are employed to identify a constant approximation

\hat{γ}

to the optimal radius

r_{k}^{*} (W)

. The second family, referred to as coreset points, contains, for any guess

γ

, three “expanded” sets of attraction, representative, and orphan points, which refine the coverage provided by the corresponding sets of validation points, in the sense that the coreset points relative to the guess

\hat{γ}

yield a coreset T such that

\max_{p \in W} d (p, T) \leq δ r_{k}^{*} (W)

. For each

γ

, these larger sets contain

O (k {(c / δ)}^{D})

points for a suitable constant c.

We can augment the algorithm of [22] in the same fashion as we augmented the algorithm of [21] by endowing the representative and orphan coreset points with the histograms described in Section 3.1. Then, by running the resulting algorithm for

k + z

(instead of k) centers, the result stated in the following lemma is immediately obtained, where parameter

λ

and functions

w (\cdot)

and

\tilde{w} (\cdot)

have the same meanings as before.

Lemma 5.

Let

δ, λ > 0

be two design parameters. For a stream S of doubling dimension D, suitable data structures can be maintained, from which, at any time t, a weighted δ-coreset T of size

O ({(c / δ)}^{D} (k + z))

for a fixed constant c can be extracted such that, for each

p \in W

, there exists a proxy

π (p) \in T

with

dist (p, π (p)) \leq δ r_{k + z}^{*} (W) \leq δ r_{k, z}^{*} (W) .

Moreover, for each

r \in T

, an approximate weight

\tilde{w} (r)

can be computed, with

w (r) / (1 + λ) \leq \tilde{w} (r) \leq w (r)

, where

w (r) = | {p \in W : π (p) = r} |

. The data structures require a working memory of size

O ({(c / δ)}^{D} (k + z) \log (d_{\max} / d_{\min}) \log_{1 + λ} (| W |))

.

The analysis in [22] implies that the constant c can be fixed arbitrarily close to 32. We remark that the construction described in Section 3.1 cannot return

δ

-coresets with

δ \leq 4

, whereas the result of Lemma 5 yields

δ

-coresets for any

δ > 0

.

To obtain the desired solution at any time t, we first compute a

δ

-coreset T satisfying the properties stated in the above lemma. Then, analogously to computeSolution, we run outliersCluster

(T, k, ρ, δ)

for a geometric sequence of values of

ρ

of step

1 + β

between

d_{\min}

and

d_{\max}

, stopping at the minimum value

ρ_{m i n}

for which, if

(X, T^{'})

is the output of outliersCluster

(T, k, ρ_{m i n}, δ)

, then the aggregate approximate weight of set

T^{'}

is at most z. The algorithm returns X as the final set of (at most) k centers. By choosing

β = δ / (3 + 4 δ)

, we obtain the the following result:

Theorem 4.

Let

δ, λ > 0

be two design parameters and consider a stream of doubling dimension D. There exists a sliding window algorithm that, at any time t, returns a set of, at most, k centers

X \subseteq W

, such that at least

| W | - (1 + λ) z

points of W are at a distance of, at most,

(3 + 6 δ) r_{k, z}^{*} (W)

from X. For a suitable constant

c > 0

, the algorithm makes use of a working memory of size

{O ((c / δ)}^{D} (k + z) \log (d_{\max} / d_{\min}) \log_{1 + λ} (| W |))

. In addition, the algorithm requires time

O (\log_{1 + β} (d_{\max} / d_{\min}) \cdot ({(c / δ)}^{D} (k + z) + \log_{1 + λ} (| W |)))

to update the data structures after each point arrival, and time

O (\log (\log_{1 + β} (d_{\max} / d_{\min})) \cdot {({(c / δ)}^{D} \cdot k (k + z))}^{2})

to compute the final solution, with

β = δ / (3 + 4 δ)

.

Proof.

By reasoning as in the proof of Lemma 3, we can show that, by executing outliersCluster

(T, k, ρ, δ)

, with any

ρ \geq r_{k, z}^{*} (W)

, the set of uncovered points at the end of the execution has an aggregate approximate weight of, at most, z. This immediately implies that

ρ_{m i n} \leq (1 + β) r_{k, z}^{*} (W)

. Let

(X, T^{'})

be the output of outliersCluster

(T, k, ρ_{m i n}, δ)

, and let

W^{'}

be the set of points of W whose proxies are in

T^{'}

. We have that, for each

p \in W - W^{'}

,

dist (π (p), X) \leq (3 + 4 δ) ρ_{m i n} \leq (3 + 4 δ) (1 + β) r_{k, z}^{*} (W)

. Since T is an

δ

-coreset, we conclude that, for every

p \in W - W^{'}

,

\begin{matrix} dist (p, X) & \leq & dist (p, π (p)) + dist (π (p), X) \\ \leq & δ r_{k, z}^{*} (W) + (3 + 4 δ) (1 + β) r_{k, z}^{*} (W) \\ \leq & (3 + 6 δ) r_{k, z}^{*} (W), \end{matrix}

where the last inequality uses the fact that

β = δ / (3 + 4 δ)

. Moreover, we also have that

| W^{'} | \leq (1 + λ) \sum_{r \in T^{'}} \tilde{w} (r) \leq (1 + λ) z

. Finally, the bound on the working memory follows from Lemma 5, whereas the time bounds for updating the data structures and extracting the solution from the coreset are obtained by adapting the arguments used to prove Theorems 2 and 3. □

The following corollary is the counterpart of Corollary 1 for the dimension-sensitive algorithm developed in this section.

Corollary 2.

Let

λ = 1 / (2 z)

. At any time t, the algorithm is able to compute a set

X \subseteq W

of, at most, k centers, such that at least

| W | - z

points of W are at a distance of, at most,

(3 + 6 δ) r_{k, z}^{*} (W)

from X. For a fixed constant

c > 0

, the algorithm requires a working memory of size

O ({(c / δ)}^{D} z (k + z) \log (d_{\max} / d_{\min}) \log (| W |))

and runs in time

O (\log (\log_{1 + β} (d_{\max} / d_{\min})) {(c / δ)}^{D} \cdot k {(k + z)}^{2}),

with

β = δ / (3 + 4 δ)

.

It is important to remark that the algorithm in [22] that we have built upon is fully oblivious to D,

d_{\max}

, and

d_{\min}

. Its augmentation discussed above inherits the obliviousness to D straightforwardly, whereas the obliviousness to

d_{\max}

and

d_{\min}

is inherited through the technique described in Section 3.3.

4. Effective Diameter Estimation

Consider a stream S of doubling dimension D. Building on the improved coreset construction of Lemma 5, we now outline an algorithm that, at any time t, is able to compute lower and upper estimates of the

α

-effective diameter

Δ_{W}^{α}

of the current window W. The algorithm requires the knowledge of a (possibly crude) lower bound

η \in (0, 1)

on the ratio between

Δ_{W}^{α}

and the diameter

Δ_{W}

(i.e.,

Δ_{W}^{α} \geq η Δ_{W}

).

For a given

ε > 0

, let T be a weighted

δ

-coreset for W computed with the properties stated in Lemma 5, with

δ = ε η / 2

,

k = 1

and

z = 0

. Hence,

| T | = O ({(c^{'} / (ε η))}^{D})

, with

c^{'} = 2 c

, where c is the same constant appearing the statement of the lemma. Assume for now that, for each

r \in T

, the true weight

w (r) = | {p \in W : π (p) = r} |

is known. (Later, we will discuss the distortion introduced by using the approximate weights

\tilde{w} (r)

.) An approximation to

Δ_{W}^{α}

can be computed on the coreset through the following quantity:

Δ_{T, W}^{α} = \underset{d}{argmin} \{\sum_{\overset{r_{1}, r_{2} \in T :}{dist (r_{1}, r_{2}) \leq d}} w (r_{1}) w (r_{2}) \geq α {| W |}^{2}\} .

Lemma 6.

(1 - ε) Δ_{W}^{α} \leq Δ_{T, W}^{α} \leq (1 + ε) Δ_{W}^{α} .

Proof.

Let

\hat{d} = \max {dist (p, π (p)) : p \in W}

. Recall that the

δ

-coreset T was computed with

k = 1

and

z = 0

; hence, using the properties of T and the fact that

Δ_{W}^{α} \geq η Δ_{W}

, we have that

\hat{d} \leq δ r_{1}^{*} (W) \leq (ε η / 2) Δ_{W} \leq (ε / 2) Δ_{W}^{α} .

Using the triangle inequality, for any d and any pair

(p, q) \in W \times W

, we have that, if

dist (π (p), π (q)) \leq d - 2 \hat{d}

, then

dist (p, q) \leq d

. Thus, when

| {(p, q) \in W \times W : dist (π (p), π (q)) \leq d - 2 \hat{d}} {| \geq α | W |}^{2}

, we must also have

| {(p, q) \in W \times W : dist (p, q) \leq d} | \geq {α | W |}^{2}

. Consequently,

\begin{matrix} Δ_{W}^{α} & = \underset{d}{argmin} \{| {(p, q) \in W \times W : dist (p, q) \leq d} | \geq {α | W |}^{2}\} \\ \leq \underset{d}{argmin} \{| {(p, q) \in W \times W : dist (π (p), π (q)) \leq d - 2 \hat{d}} {| \geq α | W |}^{2}\} \\ = \underset{d}{argmin} \{\sum_{\overset{r_{1}, r_{2} \in T :}{dist (r_{1}, r_{2}) \leq d - 2 \hat{d}}} w (r_{1}) w (r_{2}) \geq α {| W |}^{2}\} \\ = \underset{d}{argmin} \{\sum_{\overset{r_{1}, r_{2} \in T :}{dist (r_{1}, r_{2}) \leq d}} w (r_{1}) w (r_{2}) \geq α {| W |}^{2}\} + 2 \hat{d} \\ = Δ_{T, W}^{α} + 2 \hat{d} \leq Δ_{T, W}^{α} + ε Δ_{W}^{α}, \end{matrix}

thus proving the first stated inequality. The proof of the other inequality is accomplished with a symmetrical argument. The triangle inequality ensures that, for every pair

(p, q) \in W \times W

, if

dist (p, q) \leq d

, then

dist (π (p), π (q)) \leq d + 2 \hat{d}

. Thus, when

| {(p, q) \in W \times W : dist (p, q) \leq d} | \geq {α | W |}^{2}

, we must also have

| {(p, q) \in W \times W : dist (π (p), π (q)) \leq d + 2 \hat{d}} {| \geq α | W |}^{2}

. Consequently,

\begin{matrix} Δ_{W}^{α} & = \underset{d}{argmin} \{| {(p, q) \in W \times W : dist (p, q) \leq d} | \geq {α | W |}^{2}\} \\ \geq \underset{d}{argmin} \{| {(p, q) \in W \times W : dist (π (p), π (q)) \leq d + 2 \hat{d}} {| \geq α | W |}^{2}\} \\ = \underset{d}{argmin} \{\sum_{\overset{r_{1}, r_{2} \in T :}{dist (r_{1}, r_{2}) \leq d + 2 \hat{d}}} w (r_{1}) w (r_{2}) \geq α {| W |}^{2}\} \\ = \underset{d}{argmin} \{\sum_{\overset{r_{1}, r_{2} \in T :}{dist (r_{1}, r_{2}) \leq d}} w (r_{1}) w (r_{2}) \geq α {| W |}^{2}\} - 2 \hat{d} \\ = Δ_{T, W}^{α} - 2 \hat{d} \geq Δ_{T, W}^{α} - ε Δ_{W}^{α} . \end{matrix}

□

Recall now that, for every coreset point

r \in T

, only an approximation

\tilde{w} (r)

to the actual weight

w (r)

is available, with

w (r) / (1 + λ) \leq \tilde{w} (r) \leq w (r)

. We define the approximate counterpart of

Δ_{T, W}^{α}

as

{\tilde{Δ}}_{T, W}^{α} = \underset{d}{argmin} \sum_{\overset{r_{1}, r_{2} \in T :}{dist (r_{1}, r_{2}) \leq d}} \tilde{w} (r_{1}) \tilde{w} (r_{2}) \geq α {| W |}^{2} .

Our approximation algorithm returns

(1 / (1 + ε)) {\tilde{Δ}}_{T, W}^{α / {(1 + λ)}^{2}}

and

(1 / (1 - ε)) {\tilde{Δ}}_{T, W}^{α}

as the lower and upper estimates, respectively, of the true effective diameter

Δ_{W}^{α}

. The following theorem establishes the tightness of these estimates and the space and time performance of the algorithm.

Theorem 5.

Consider a stream S of doubling dimension D, and a value

α \in (0, 1)

. Suppose that a value

η < 1

is known such that, for every window W,

Δ_{W}^{α} \geq η Δ_{W}

. For any

ε, λ > 0

, there exists a sliding window algorithm that, at any time t, is able to compute a weighted coreset T of size

O ({(c^{'} / (ε η))}^{D})

, such that

\frac{1}{1 + ε} {\tilde{Δ}}_{T, W}^{α / {(1 + λ)}^{2}} \leq Δ_{W}^{α} \leq \frac{1}{1 - ε} {\tilde{Δ}}_{T, W}^{α},

where W is the current window and

c^{'} > 0

is a suitable constant. The algorithm makes use of a working memory of size

O ({(c^{'} / (ε η))}^{D} \log (d_{\max} / d_{\min}) \log_{1 + λ} (| W |))

. In addition, the algorithm requires time

O (\log_{1 + β} (d_{\max} / d_{\min}) \cdot ((c^{'} / (ε η)) + \log_{1 + λ} (| W |)))

to update the data structures after each point arrival, where

β = δ / (3 + 4 δ)

. The lower and upper estimates to

Δ_{W}^{α}

can be computed from T in time

O ({| T |}^{2}) = O ({(c^{'} / (ε η))}^{2 D})

.

Proof.

We first prove that

{\tilde{Δ}}_{T, W}^{α / {(1 + λ)}^{2}} \leq Δ_{T, W}^{α} \leq {\tilde{Δ}}_{T, W}^{α} .

Then, the stated approximation interval will immediately follow by Lemma 6. Let us first prove the leftmost inequality. From Lemma 5 we have that, for every

r \in T

,

\tilde{w} (r) \geq w (r) / (1 + λ)

. Hence, for any d such that

\sum_{\overset{r_{1}, r_{2} \in T :}{dist (r_{1}, r_{2}) \leq d}} w (r_{1}) w (r_{2}) \geq α {| W |}^{2},

we have that

\sum_{\overset{r_{1}, r_{2} \in T :}{dist (r_{1}, r_{2}) \leq d}} \tilde{w} (r_{1}) \tilde{w} (r_{2}) \geq \frac{α}{{(1 + λ)}^{2}} {| W |}^{2},

which implies

{\tilde{Δ}}_{T, W}^{α / {(1 + λ)}^{2}} \leq Δ_{T, W}^{α}

. The righmost inequality is proved in a symmetrical fashion. Again, from Lemma 5, we have that, for every

r \in T

,

w (r) \geq \tilde{w} (r)

. Hence, for any d such that

\sum_{\overset{r_{1}, r_{2} \in T :}{dist (r_{1}, r_{2}) \leq d}} \tilde{w} (r_{1}) \tilde{w} (r_{2}) \geq α {| W |}^{2},

we have that

\sum_{\overset{r_{1}, r_{2} \in T :}{dist (r_{1}, r_{2}) \leq d}} w (r_{1}) w (r_{2}) \geq α {| W |}^{2},

which implies

Δ_{T, W}^{α} \leq {\tilde{Δ}}_{T, W}^{α}

. The bounds on the working memory and on the update time follow directly from Theorem 4 and from the choice of

k = 1

and

z = 0

. Finally, the estimates

{\tilde{Δ}}_{T, W}^{α / {(1 + λ)}^{2}}

and

{\tilde{Δ}}_{T, W}^{α}

can be computed from T in time

O ({| T |}^{2}) = O ({(c^{'} / (ε η))}^{2 D})

using a simple strategy based on binary search. □

The theorem implies that by setting

ε

and

λ

sufficiently small, we can obtain tight estimates for

Δ_{W}^{α}

for all windows for which the value of the

α

-effective diameter behaves smoothly in an interval to the left of

α

. Finally, we remark that the algorithm is fully oblivious to D,

d_{\min}

, and

d_{\max}

. Moreover, while the theoretical space bound exhibits a dependency on

{(1 / η)}^{D}

, in the next section, we provide experimental evidence of a much lesser impact of

η

for datasets where outliers represent true noise, proving that the working space requirements exhibit a milder dependence on the crudeness of the lower bound

η

.

5. Experiments

We implemented the algorithms for k-center with z outliers and for the estimation of the effective diameter presented in Section 3 and Section 4. For what concerns k-center with z outliers, we implemented the dimensionality-sensitive algorithm of Section 3.4, which offers a wider spectrum of performance–accuracy tradeoffs. We ran proof-of-concept experiments aimed at testing the algorithms’ behavior against relevant competitors in terms of approximation, memory usage, and running time for processing each point arrival (update time) and for computing a solution for the current window whenever needed (query time). All tests were executed using Java 13 on a Windows machine running on an AMD FX8320 processor with 12GB of RAM, with the running times measured using System.nanoTime, and with the points to the algorithms being fed through the file input stream.

5.1. k-Center with Outliers

Along with our algorithm (dubbed our-sliding), we implemented the sequential 3-approximation by [7] (dubbed charikar) to be run on the entire window W, consisting of a search for the minimum

ρ

such that outliersCluster

(W, k, ρ, 0)

, run with unit weights, ends with, at most, z uncovered points. We chose this sequential benchmark over the existing LP-based 2-approximation algorithms since these latter algorithms do not seem to admit practical implementations. Since charikar itself is plagued by a superquadratic complexity, which makes it unfeasible for larger windows, we also devised a sampled version (dubbed samp-charikar) where the center selection in each call to outliersCluster examines only a fixed number of random candidates, rather than all window points. We deemed it unnecessary to perform a comparison of our algorithm with the one in [23], since, as mentioned in the introduction, this latter algorithm needs to run an instance of outliersCluster for each update operation, and would thus prove to be a poor competitor of our strategy, where the execution of this expensive sequential procedure is confined only to the query operation.

Also, to assess the importance of using a specialized algorithm to handle outliers, we compared the quality of our solution against the one returned by the algorithm of [12] for k-center without outliers (dubbed gon), where the radius is computed excluding the z largest distances from the centers.

The algorithms were tested on the following datasets, often used in previous works [8]: the Higgs dataset (http://archive.ics.uci.edu/ml/datasets/HIGGS), (accessed on 10 January 2022), which contains 11 million seven-dimensional points representing high-energy particle features generated through Monte-Carlo simulations; and the Cover dataset (https://archive.ics.uci.edu/ml/datasets/covertype), (accessed on 10 January 2022), which contains 581,012 55-dimensional points from geological observations of US forest biomes, and was employed as a stress test for our dimensionality-sensitive algorithm. We also generated inflated versions of the original datasets, dubbed Higgs+ and Cover+, by artificially injecting a new true outlier point after each original point with probability p, where the new point has a norm 100 times the diameter of the original dataset (e.g., as if produced by a malfunctioning sensor). The probability p was chosen to yield

z / 2

true outliers per window, in expectation. We performed tests for

k = 10

,

z = 10, 50

, and window sizes

| W | = N \in {10^{4}, 10^{5}, 10^{6}}

using Euclidean distance.

For our-sliding, we set

δ = 2 / 3

,

β = 0.5

, and

λ = 0.5

; moreover, we set

d_{\min} = 0.01

and

d_{\max} = 10^{4}

, which are conservative lower and upper estimates of the clustering radii for all windows and all datasets. The implementations of charikar and samp-charikar execute, for a window W, a search for a minimum

ρ

such that outliersCluster

(W, k, ρ, 0)

with unit weights ends with

\leq z

uncovered points. For samp-charikar, outliersCluster has been modified so that each new center is selected among a set of random window points of expected size 1000.

Table 1 and Table 2 detail the full results of the experiments on k-center clustering with z outliers. All quantities are provided as one-sigma confidence intervals, based on 10 windows sampled every

10^{4}

timesteps after the first N insertions. Starred results are based on a single sample due to the excessively high running time. Table 1 reports the average ratio between the clustering radius obtained by each tested algorithm and the one obtained by our-sliding, as well as the average number of floats maintained in memory by the algorithms. (All radii have been computed with respect to the entire window, excluding the z largest distances from the centers.) As shown in the table, our-sliding is always within a few percentage points from the radius of the solution of charikar (which did not finish in reasonable time for

N = 10^{6}

). On the other hand, the quality of the solution of samp-charikar degrades as the window size grows, since the fraction of center candidates decreases. Moreover, as expected, gon yields a poorer performance, especially in the presence of true outliers, as these are mistakenly selected as centers instead of being disregarded. Hence, gon is not considered in the successive experiments. Figure 1 plots the memory usage (in floats) of the algorithms for Higgs, confirming that the working memory required by our-sliding grows sublinearly with N, and it is much smaller than the one required by charikar and samp-charikar, which is linear in N.

Table 2 reports update times (in milliseconds) and query times (in seconds). For our-sliding, the update time is the time required to process any newly arrived point, whereas the query time includes the time to extract the coreset and compute the final solution on the coreset. For charikar and samp-charikar, the update time is null, whereas the query time is the time taken to extract the solution from the whole window. The running times reveal that, by virtue of the coreset-based approach, our-sliding features a query time that is much smaller than the one of charikar and samp-charikar. The update times for our-sliding, although not negligible, are three orders of magnitude smaller than the query times. The experiments for

z = 50

show that, whereas the working memory requirements and running times increase with z, for reasonably large values of N, our-sliding still exhibits a much lower memory footprint than charikar and still returns solutions of comparable quality.

We also tested the sensitivity of our algorithm’s performance to the

λ

parameter, making it range in

[0, 1]

. The results for Higgs are shown in Figure 2 and in Table 3 and Table 4. Specifically, Table 3 reports on the sensitivity of the clustering radius and of the memory requirements, whereas Table 4 reports on the sensitivity of the update and query times. The experiments were run on Higgs and Cover, with

k = z = 10

, setting, as before,

δ = 2 / 3

,

β = 0.5

,

d_{\min} = 0.01

,

d_{\max} = 10^{4}

. For

λ

, we used the values:

0, 0.1, 0.5, 1

. As shown in Figure 2, setting

λ = 0

(i.e., maintaining the full histograms) leads to an unbearable increase in memory usage, and, hence, in execution times. With approximate histograms (i.e.,

λ > 0

), the memory usage decreases as

λ

increases, with a significant drop already for

λ = 0.1

. While

λ = 0

yields the solution with the best approximation, in our tests, we often obtained the same solution using

λ = 0.1

. Most importantly, the degradation of the clustering radius never exceeded

1 %

, even for

λ = 1

.

Overall, the experiments confirm that our-sliding is able to achieve a precision comparable to the sequential algorithms at a fraction of their memory/time requirements.

5.2. Effective Diameter

We compared our algorithm for estimating the effective diameter described in Section 4 (dubbed eff-sliding) against the following sequential baseline (dubbed eff-sequential). eff-sequential computes all

N^{2}

distances in the window W and, to avoid storing all of them, only keeps track of how many distances lay in each interval

[d_{\min} \cdot {(1 + ρ)}^{i}, d_{\min} \cdot {(1 + ρ)}^{i + 1}]

, for

i \geq 0

, by maintaining the appropriate counters. We set

ρ = 0.01

so that the error due to this discretization is minimal. After all distances have been computed, the algorithm sweeps the counters and returns the minimum value

d_{\min} \cdot {(1 + ρ)}^{i}

for which at least

⌈ α | W |^{2} ⌉

distances fall below that value. This same procedure, adapted to account for weights, is also used in eff-sliding to compute the solution on the weighted coreset.

We experimented on the Higgs-eff dataset, which is another artificially inflated version of the Higgs dataset where a true outlier (i.e., a random point whose norm is 100 times the diameter of the original dataset) is injected, on average, every 1000 points. In the experiment, we set

α = 0.9

. Since, for every tested window size N,

α N^{2} ≪ {(N - N / 1000)}^{2}

, we expect, in this controlled experiment, the

α

-effective diameter of each window of Higgs-eff to be close to the diameter of the non-outlier points in the window. For our algorithm, we set

ε = 5 / 3

and

λ = β = 0.5

. Moreover, we set

d_{\min} = 0.01

and

d_{\max} = 10^{4}

. Finally, we set

η = 1 / 1000

, which is a very conservative lower bound to the ratio between the effective diameter and the diameter of the dataset for any window W.

The results of these experiments are reported in Table 5 and Table 6. Table 5 reports the ratio between the effective diameter computed by eff-sequential and the (conservative) upper estimate

{\tilde{Δ}}_{T, W}^{α}

computed by eff-sliding, as well as the average number of floats maintained in memory by the algorithms. As shown in the table, the solution returned by eff-sliding is almost indistinguishable from the one returned by eff-sequentialfor those window sizes for which eff-sequential, whose complexity grows quadratically, could be executed within reasonable times. On the other hand, the memory usage of eff-sliding grows very slowly with N, and, thus, for large enough window sizes, becomes lower than the one of eff-sequential. Table 6 reports the update and query times for the two algorithms. Due to the reduced coreset size, the query times of eff-sliding are orders of magnitude lower than those of eff-sequential, and the updated times of eff-sliding, although not negligible, are significantly smaller than the query times.

Finally, we tested on tailor-made artificial datasets the impact of the parameter

η

, which is a (possibly crude) lower bound on the ratio between the effective diameter and the diameter. In fact, the theoretical bounds on the coreset size embody a factor proportional to

{(1 / η)}^{D}

, which could lead to a severe deterioration of the performance indicators for low (i.e., conservative) values of

η

. In reality, for datasets where the discrepancy between the diameter and effective diameter is caused by few distant outliers (noisy points), since the balls centered on coreset points have radius

O (ε η Δ_{W}) = O (ε Δ_{W}^{α})

and since most of the points will be contained in a ball of radius

O (Δ_{W}^{α})

, with only a few outliers at distance

O (Δ_{W})

, the actual number of points maintained in the coreset should not really depend on

η

. To test this intuition, we created artificial datasets by generating random points in a ball of unit radius with a few outliers (one every 1000 points, on average) on the surface of a ball of radius R for values of R in

{10, 100, 1000}

. We set

α = 0.9

as before, and ran our algorithm with

ε = 10 / 3

and

λ = 0.5

. Moreover, we set

η = 1 / (2 R)

as it lower bounds the ratio between the effective diameter and the diameter. We report the results for window size

N = 10^{5}

, since a similar pattern emerges for other values of N. Indeed, as Table 7 shows, the memory usage is, in practice, almost constant across all values of

η

.

6. Conclusions

In this paper, we have presented coreset-based streaming algorithms for the k-center problem with z outliers and for the estimation of the

α

-effective diameter under the sliding window setting. Our algorithms require a working memory that is considerably smaller than the window size, and, with respect to the state-of-the-art sequential algorithms executed on the entire window, they are up to orders of magnitude faster while achieving a comparable accuracy. The effectiveness of our approach has been confirmed by a set of proof-of-concept experiments on both real-world and synthetic datasets.

Based on the theoretical analysis conducted in the paper, the space and time required by our algorithms to attain a high accuracy seem to grow steeply (in fact, exponentially) with the doubling dimension of the stream. An interesting, yet challenging, research avenue is to investigate whether this steep dependence can be ameliorated by means of alternative techniques (e.g., the use of randomization). In addition, it would be interesting to incorporate some notion of local density in the clustering quality measure; for example, by using a robust distance measure, such as the one proposed in [29].

Author Contributions

Conceptualization: P.P., A.P. and G.P.; formal analysis and investigation: P.P., A.P. and G.P.; writing—original draft preparation: P.P.; writing—review and editing: A.P. and G.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by MIUR, the Italian Ministry of Education, University and Research, under PRIN Project n. 20174LF3T8 AHeAD (Efficient Algorithms for HArnessing Networked Data), and grant L. 232 (Dipartimenti di Eccellenza), and by the University of Padova under project “SID 2020: Resource-Allocation Tradeoffs for Dynamic and Extreme Data”.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code and the datasets used in our experiments are provided on GitHub at github.com/PaoloPellizzoni/OutliersSlidingWindows, (accessed on 10 January 2022).

Acknowledgments

The authors wish to thank the anonymous referees for their constructive criticism that helped to improve the quality of the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Henzinger, M.; Raghavan, P.; Rajagopalan, S. Computing on Data Streams. In Proceedings of the DIMACS Workshop on External Memory Algorithms, New Brunswick, NJ, USA, 20–22 May 1998; pp. 107–118. [Google Scholar]
Datar, M.; Motwani, R. The Sliding-Window Computation Model and Results. In Data Stream Management; Garofalakis, M., Gehrke, J., Rastogi, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2016; pp. 149–165. [Google Scholar]
Braverman, V. Sliding Window Algorithms. In Encyclopedia of Algorithms; Cao, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2016; pp. 2006–2011. [Google Scholar]
Snyder, L. Introduction to facility location. In Wiley Enciclopedia of Operations Research and Management Science; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
Hennig, C.; Meila, M.; Murtagh, F.; Rocci, R. Handbook of Cluster Analysis; CRC Press: Boca Raton, FL, USA, 2015. [Google Scholar]
Bateni, M.; Esfandiari, H.; Fischer, M.; Mirrokni, V. Extreme k-Center Clustering. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 2–9 February 2021; pp. 3941–3949. [Google Scholar]
Charikar, M.; Khuller, S.; Mount, D.; Narasimhan, G. Algorithms for Facility Location Problems with Outliers. In Proceedings of the 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington, DC, USA, 7–9 January 2001; pp. 642–651. [Google Scholar]
Malkomes, G.; Kusner, M.; Chen, W.; Weinberger, K.; Moseley, B. Fast Distributed k-Center Clustering with Outliers on Massive Data. In Proceedings of the 28th Conference on Neural Information Processing Systems (NIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 1063–1071. [Google Scholar]
Ceccarello, M.; Pietracaprina, A.; Pucci, G. Solving k-center Clustering (with Outliers) in MapReduce and Streaming, almost as Accurately as Sequentially. arXiv 2018, arXiv:1802.09205. [Google Scholar] [CrossRef] [Green Version]
Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
Altieri, F.; Pietracaprina, A.; Pucci, G.; Vandin, F. Scalable Distributed Approximation of Internal Measures for Clustering Evaluation. In Proceedings of the SIAM International Conference on Data Mining (SDM), Online. 29 April–1 May 2021; pp. 648–656. [Google Scholar]
Gonzalez, T.F. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 1985, 38, 293–306. [Google Scholar] [CrossRef] [Green Version]
Hochbaum, D.; Shmoys, D. A Best Possible Heuristic for the k-Center Problem. Math. Oper. Res. 1985, 10, 180–184. [Google Scholar] [CrossRef] [Green Version]
Chan, T.H.H.; Guerqin, A.; Sozio, M. Fully Dynamic k-Center Clustering. In Proceedings of the World Wide Web Conference, Lyon, France, 23–27 April 2018; Volume 2018, pp. 579–587. [Google Scholar]
Harris, D.; Pensyl, T.; Srinivasan, A.; Trinh, K. A Lottery Model for Center-Type Problems With Outliers. ACM Trans. Algorithms 2019, 15, 36. [Google Scholar] [CrossRef]
Chakrabarty, D.; Goyal, P.; Krishnaswamy, R. The Non-Uniform k-Center Problem. ACM Trans. Algorithms 2020, 16, 46. [Google Scholar] [CrossRef]
Ding, H.; Yu, H.; Wang, Z. Greedy Strategy Works for k-Center Clustering with Outliers and Coreset Construction. In Proceedings of the 27th Annual European Symposium on Algorithms (ESA), Munich/Garching, Germany, 9–11 September 2019; pp. 40:1–40:16. [Google Scholar]
McCutchen, R.; Khuller, S. Streaming Algorithms for k-Center Clustering with Outliers and with Anonymity. In Approximation, Randomization and Combinatorial Optimization. Algorithms and Techniques; Springer: Berlin/Heidelberg, Germany, 2008; pp. 165–178. [Google Scholar]
Feldmann, A.; Marx, D. The Parameterized Hardness of the k-Center Problem in Transportation Networks. Algorithmica 2020, 82, 1989–2005. [Google Scholar] [CrossRef] [Green Version]
Cohen-Addad, V.; Feldmann, A.; Saulpic, D. Near-Linear Time Approximation Schemes for Clustering in Doubling Metrics. J. ACM 2021, 68, 1–34. [Google Scholar] [CrossRef]
Cohen-Addad, V.; Schwiegelshohn, C.; Sohler, C. Diameter and k-Center in Sliding Windows. In Proceedings of the 43th International Colloquium on Automata, Languages and Programming (ICALP), Rome, Italy, 11–15 July 2016; pp. 19:1–19:12. [Google Scholar]
Pellizzoni, P.; Pietracaprina, A.; Pucci, G. Dimensionality-adaptive k-center in sliding windows. In Proceedings of the 7th IEEE International Conference on Data Science and Advanced Analytics (DSAA), Sydney, Australia, 6–9 October 2020; pp. 197–206. [Google Scholar]
de Berg, M.; Monemizadeh, M.; Zhong, Y. k-Center Clustering with Outliers in the Sliding-Window Model. In Proceedings of the 29th Annual European Symposium on Algorithms (ESA), Lisbon, Portugal, 29–30 September 2021; pp. 13:1–13:13. [Google Scholar]
Braverman, V.; Lang, H.; Levin, K.; Monemizadeh, M. Clustering Problems on Sliding Windows. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), Arlington, VA, USA, 10–12 January 2016; pp. 1374–1390. [Google Scholar]
Borassi, M.; Epasto, A.; Lattanzi, S.; Vassilvitskii, S.; Zadimoghaddam, M. Sliding Window Algorithms for k-Clustering Problems. In Proceedings of the 34th Neural Information Processing Systems (NeurIPS), Online, 6–12 December 2020. [Google Scholar]
Palmer, C.; Gibbons, P.; Faloutsos, C. ANF: A fast and scalable tool for data mining in massive graphs. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Edmonton, AB, Canada, 23–26 July 2002; pp. 81–90. [Google Scholar]
Braverman, V.; Ostrovsky, R. Smooth Histograms for Sliding Windows. In Proceedings of the 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), Providence, RI, USA, 20–23 October 2007; pp. 283–293. [Google Scholar]
Gottlieb, L.A.; Kontorovich, A.; Krauthgamer, R. Efficient Classification for Metric Data. IEEE Trans. Inf. Theory 2014, 60, 5750–5759. [Google Scholar] [CrossRef] [Green Version]
Hu, L.; Zhong, C. An Internal Validity Index Based on Density-Involved Distance. IEEE Access 2019, 7, 40038–40051. [Google Scholar] [CrossRef]

Figure 1. Working memory (Higgs,

z = 10

).

Figure 1. Working memory (Higgs,

z = 10

).

Figure 2. our-sliding: sensitivity to

λ

(Higgs,

z = 10

).

Figure 2. our-sliding: sensitivity to

λ

(Higgs,

z = 10

).

Table 1. Comparison between clustering radii and between working memory requirements.

Dataset	Algorithm	Obj. Ratio			Memory ( $\times 10^{6}$ Floats)
		Window Size			Window Size
		$10^{4}$	$10^{5}$	$10^{6}$	$10^{4}$	$10^{5}$	$10^{6}$
Higgs (z = 10)	our-sliding	$1 \pm 0$	$1 \pm 0$	$1 \pm 0$	$0.13 \pm 0.01$	$0.27 \pm 0.02$	$0.42 \pm 0.02$
	charikar	$1.02 \pm 0.05$	$0.99 \pm 0.04$	-	$0.07 \pm 0$	$0.7 \pm 0$	$7 \pm 0$
	samp-charikar	$1.07 \pm 0.05$	$1.43 \pm 0.23$	$2.74 \pm 0.7$	$0.07 \pm 0$	$0.7 \pm 0$	$7 \pm 0$
	gon	$1.18 \pm 0.12$	$1.17 \pm 0.08$	$1.05 \pm 0.03$	$0.07 \pm 0$	$0.7 \pm 0$	$7 \pm 0$
Higgs (z = 50)	our-sliding	$1 \pm 0$	$1 \pm 0$	$1 \pm 0$	$0.26 \pm 0.01$	$0.65 \pm 0.03$	$1.25 \pm 0.02$
	charikar	$1.02 \pm 0.04$	-	-	$0.07 \pm 0$	$0.7 \pm 0$	$7 \pm 0$
	samp-charikar	$1.04 \pm 0.06$	$1.14 \pm 0.08$	$1.59 \pm 0.15$	$0.07 \pm 0$	$0.7 \pm 0$	$7 \pm 0$
	gon	$1.54 \pm 0.19$	$1.5 \pm 0.15$	$1.19 \pm 0.06$	$0.07 \pm 0$	$0.7 \pm 0$	$7 \pm 0$
Cover (z = 10)	our-sliding	$1 \pm 0$	$1 \pm 0$		$0.72 \pm 0.18$	$2.06 \pm 0.13$
	charikar	$0.97 \pm 0.17$	1.02 *		$0.55 \pm 0$	$5.5 \pm 0$
	samp-charikar	$0.96 \pm 0.17$	$0.98 \pm 0.18$		$0.55 \pm 0$	$5.5 \pm 0$
	gon	$1.1 \pm 0.17$	$0.94 \pm 0.15$		$0.55 \pm 0$	$5.5 \pm 0$
Cover (z = 50)	our-sliding	$1 \pm 0$	$1 \pm 0$		$1.83 \pm 0.22$	$5.38 \pm 0.22$
	charikar	$0.98 \pm 0.09$	-		$0.55 \pm 0$	$5.5 \pm 0$
	samp-charikar	$0.99 \pm 0.1$	$1.04 \pm 0.2$		$0.55 \pm 0$	$5.5 \pm 0$
	gon	$1.13 \pm 0.11$	$1.02 \pm 0.17$		$0.55 \pm 0$	$5.5 \pm 0$
Higgs+ (z = 10)	our-sliding	$1 \pm 0$	$1 \pm 0$	$1 \pm 0$	$0.12 \pm 0.01$	$0.26 \pm 0.02$	$0.43 \pm 0.02$
	charikar	$0.98 \pm 0.18$	$0.97 \pm 0.04$	-	$0.07 \pm 0$	$0.7 \pm 0$	$7 \pm 0$
	samp-charikar	$1.09 \pm 0.17$	$1.61 \pm 0.2$	$3.03 \pm 0.51$	$0.07 \pm 0$	$0.7 \pm 0$	$7 \pm 0$
	gon	$1.63 \pm 0.29$	$1.3 \pm 0.16$	$1.4 \pm 0.02$	$0.07 \pm 0$	$0.7 \pm 0$	$7 \pm 0$
Cover+ (z = 10)	our-sliding	$1 \pm 0$	$1 \pm 0$		$0.65 \pm 0.17$	$1.93 \pm 0.21$
	charikar	$0.99 \pm 0.15$	1.02 *		$0.55 \pm 0$	$5.5 \pm 0$
	samp-charikar	$0.97 \pm 0.14$	$0.96 \pm 0.15$		$0.55 \pm 0$	$5.5 \pm 0$
	gon	$3.07 \pm 2.02$	$1.44 \pm 0.22$		$0.55 \pm 0$	$5.5 \pm 0$

Table 2. Comparison between update times and between query times.

Dataset	Algorithm	Update Time (ms)			Query Time (s)
		Window Size			Window Size
		$10^{4}$	$10^{5}$	$10^{6}$	$10^{4}$	$10^{5}$	$10^{6}$
Higgs (z = 10)	our-sliding	$0.66 \pm 0.17$	$1 \pm 0.44$	$1.98 \pm 0.7$	$2.52 \pm 1.22$	$5.5 \pm 0.87$	$8.2 \pm 0.42$
	charikar	$0 \pm 0$	$0 \pm 0$	$0 \pm 0$	$24.88 \pm 2.52$	$3080.8 \pm 160.8$	-
	samp-charikar	$0 \pm 0$	$0 \pm 0$	$0 \pm 0$	$2.51 \pm 0.26$	$31.5 \pm 1.34$	$297.8 \pm 30.9$
Higgs (z = 50)	our-sliding	$1.24 \pm 0.66$	$4.03 \pm 0.53$	$5.39 \pm 1.14$	$10.51 \pm 0.76$	$93.22 \pm 24.76$	$75.99 \pm 10.38$
	charikar	$0 \pm 0$	$0 \pm 0$	$0 \pm 0$	$27.97 \pm 2.68$	–	–
	samp-charikar	$0 \pm 0$	$0 \pm 0$	$0 \pm 0$	$2.87 \pm 0.24$	$47.95 \pm 2.11$	$384.4 \pm 90.92$
Cover (z = 10)	our-sliding	$1.05 \pm 0.46$	$3.74 \pm 0.57$		$12.85 \pm 2.59$	$87.82 \pm 46.08$
	charikar	$0 \pm 0$	$0 \pm 0$		$314.93 \pm 55.21$	$5 \cdot 10^{4}$ *
	samp-charikar	$0 \pm 0$	$0 \pm 0$		$32.1 \pm 5.25$	$400.81 \pm 66.89$
Cover (z = 50)	our-sliding	$3.31 \pm 0.8$	$11.22 \pm 1.28$		$44.06 \pm 6.88$	$892.82 \pm 654.3$
	charikar	$0 \pm 0$	$0 \pm 0$		$310.87 \pm 43.6$	-
	samp-charikar	$0 \pm 0$	$0 \pm 0$		$31.06 \pm 3.05$	$431.18 \pm 28.42$
Higgs+ (z = 10)	our-sliding	$0.65 \pm 0.26$	$1.22 \pm 0.37$	$1.89 \pm 0.42$	$0.92 \pm 0.93$	$5.28 \pm 1.82$	$8.37 \pm 0.31$
	charikar	$0 \pm 0$	$0 \pm 0$	$0 \pm 0$	$27.53 \pm 3.53$	$3224.2 \pm 252.4$	–
	samp-charikar	$0 \pm 0$	$0 \pm 0$	$0 \pm 0$	$2.7 \pm 0.22$	$35.34 \pm 3.14$	$328.1 \pm 36.08$
Cover+ (z = 10)	our-sliding	$1.21 \pm 0.5$	$3.14 \pm 0.6$		$7.08 \pm 6.54$	$51.22 \pm 39.68$
	charikar	$0 \pm 0$	$0 \pm 0$		$402.32 \pm 71.95$	$4 \cdot 10^{4}$ *
	samp-charikar	$0 \pm 0$	$0 \pm 0$		$41.47 \pm 9.44$	$418.03 \pm 47.58$

Table 3. Sensitivity of our-sliding to

λ

: clustering radii and working memory requirements.

Table 3. Sensitivity of our-sliding to

λ

: clustering radii and working memory requirements.

Dataset	Algorithm	Clustering Radius			Memory ( $\times 10^{6}$ Floats)
		Window Size			Window Size
		$10^{4}$	$10^{5}$	$10^{6}$	$10^{4}$	$10^{5}$	$10^{6}$
Higgs (z = 10)	$λ = 0$	$2.826 \pm 0.178$	$4.289 \pm 0.162$	-	$0.53 \pm 0.04$	$4.85 \pm 0.96$	-
	$λ = 0.1$	$2.826 \pm 0.178$	$4.289 \pm 0.162$	$5.981 \pm 0.124$	$0.15 \pm 0.01$	$0.36 \pm 0.03$	$0.6 \pm 0.02$
	$λ = 0.5$	$2.808 \pm 0.195$	$4.297 \pm 0.165$	$6.031 \pm 0.027$	$0.13 \pm 0.01$	$0.27 \pm 0.02$	$0.42 \pm 0$
	$λ = 1$	$2.812 \pm 0.178$	$4.281 \pm 0.176$	$6.031 \pm 0.021$	$0.12 \pm 0.01$	$0.24 \pm 0.02$	$0.37 \pm 0$
Cover (z = 10)	$λ = 0$	$1177.56 \pm 400.03$	$1976.02 \pm 150.66$		$0.75 \pm 0.17$	$2.3 \pm 0.15$
	$λ = 0.1$	$1177.56 \pm 400.03$	$1973.31 \pm 149.53$		$0.74 \pm 0.17$	$2.23 \pm 0.14$
	$λ = 0.5$	$1177.56 \pm 400.03$	$1954.08 \pm 145.71$		$0.72 \pm 0.18$	$2.06 \pm 0.13$
	$λ = 1$	$1180.44 \pm 405.86$	$1967.02 \pm 150.61$		$0.71 \pm 0.18$	$2 \pm 0.12$

Table 4. Sensitivity of our-sliding to

λ

: update and query times.

Table 4. Sensitivity of our-sliding to

λ

: update and query times.

Dataset	Algorithm	Update Time (ms)			Query Time (s)
		Window Size			Window Size
		$10^{4}$	$10^{5}$	$10^{6}$	$10^{4}$	$10^{5}$	$10^{6}$
Higgs (z = 10)	$λ = 0$	$3.49 \pm 0.69$	$35.18 \pm 21.11$	-	$2.49 \pm 1.16$	$5.91 \pm 0.74$	-
	$λ = 0.1$	$0.72 \pm 0.38$	$1.56 \pm 0.53$	$2.58 \pm 0.41$	$2.47 \pm 1.18$	$5.88 \pm 0.89$	$8.88 \pm 0.47$
	$λ = 0.2$	$0.57 \pm 0.12$	$1.39 \pm 0.42$	$2.06 \pm 0.67$	$2.46 \pm 1.19$	$5.82 \pm 0.91$	$7.74 \pm 0.25$
	$λ = 1$	$0.57 \pm 0.1$	$1.32 \pm 0.36$	$1.99 \pm 0.32$	$2.47 \pm 1.2$	$5.81 \pm 0.92$	$7.77 \pm 0.15$
Cover (z = 10)	$λ = 0$	$1.12 \pm 0.38$	$3.75 \pm 0.32$		$12.44 \pm 2.38$	$90.65 \pm 51.26$
	$λ = 0.1$	$1.33 \pm 0.69$	$3.72 \pm 0.53$		$12.31 \pm 1.92$	$90.36 \pm 51.44$
	$λ = 0.2$	$1.27 \pm 0.5$	$3.57 \pm 0.25$		$12.41 \pm 2.19$	$90 \pm 51.14$
	$λ = 1$	$1.21 \pm 0.5$	$4.29 \pm 2.28$		$12.32 \pm 2.17$	$90.34 \pm 50.61$

Table 5. Effective diameter: comparison between estimates and between working memory requirements.

Dataset	Algorithm	Diameter Ratio			Memory ( $\times 10^{6}$ floats)
		Window Size			Window Size
		$10^{4}$	$10^{5}$	$10^{6}$	$10^{4}$	$10^{5}$	$10^{6}$
Higgs-eff	eff-sliding	$1 \pm 0$	$1 \pm 0$	$1 \pm 0$	$0.52 \pm 0.36$	$0.41 \pm 0.23$	$0.53 \pm 0.1$
Higgs-eff	eff-sequential	$0.991 \pm 0.019$	$0.992 \pm 0.009$	-	$0.07 \pm 0$	$0.7 \pm 0$	$7 \pm 0$

Table 6. Effective diameter: comparison between update times and between query times.

Dataset	Algorithm	Update Time (ms)			Query Time (s)
		Window Size			Window Size
		$10^{4}$	$10^{5}$	$10^{6}$	$10^{4}$	$10^{5}$	$10^{6}$
Higgs-eff	eff-sliding	$3.63 \pm 3.76$	$2.53 \pm 1.57$	$3.33 \pm 1.16$	$0.05 \pm 0.01$	$0.77 \pm 0.16$	$7.41 \pm 0.57$
Higgs-eff	eff-sequential	$0 \pm 0$	$0 \pm 0$	$0 \pm 0$	$9.26 \pm 1.23$	$994.81 \pm 38.63$	-

Table 7. Effective diameter: sensitivity to

η

.

Table 7. Effective diameter: sensitivity to

η

.

Dataset	Eff. Diameter	Memory ( $\times 10^{6}$ floats)	Update Time (ms)	Query Time (s)
$R = 10$	$1.175 \pm 0$	$0.22 \pm 0.05$	$0.91 \pm 0.31$	$1.69 \pm 0.5$
$R = 100$	$1.175 \pm 0.006$	$0.24 \pm 0.11$	$1.79 \pm 1.01$	$0.84 \pm 0.25$
$R = 1000$	$1.178 \pm 0.006$	$0.35 \pm 0.14$	$1.9 \pm 1.39$	$4.2 \pm 1.34$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Pellizzoni, P.; Pietracaprina, A.; Pucci, G. k-Center Clustering with Outliers in Sliding Windows. Algorithms 2022, 15, 52. https://doi.org/10.3390/a15020052

AMA Style

Pellizzoni P, Pietracaprina A, Pucci G. k-Center Clustering with Outliers in Sliding Windows. Algorithms. 2022; 15(2):52. https://doi.org/10.3390/a15020052

Chicago/Turabian Style

Pellizzoni, Paolo, Andrea Pietracaprina, and Geppino Pucci. 2022. "k-Center Clustering with Outliers in Sliding Windows" Algorithms 15, no. 2: 52. https://doi.org/10.3390/a15020052

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

k-Center Clustering with Outliers in Sliding Windows

Abstract

1. Introduction

1.1. Related Work

1.2. Our Contribution

1.3. Organization of the Paper

2. Preliminaries

2.1. Definition of the Problems

2.2. Doubling Dimension

3. $k$ -Center with $z$ Outliers

3.1. Weighted Coreset Construction

3.1.1. Algorithm

3.1.2. Analysis

3.2. Computation of the Solution from the Coreset

3.3. Obliviousness to $d_{\min}$ and $d_{\max}$

3.4. Improved Approximation under Bounded Doubling Dimension

4. Effective Diameter Estimation

5. Experiments

5.1. k-Center with Outliers

5.2. Effective Diameter

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

k-Center Clustering with Outliers in Sliding Windows

Abstract

1. Introduction

1.1. Related Work

1.2. Our Contribution

1.3. Organization of the Paper

2. Preliminaries

2.1. Definition of the Problems

2.2. Doubling Dimension

3. k -Center with z Outliers

3.1. Weighted Coreset Construction

3.1.1. Algorithm

3.1.2. Analysis

3.2. Computation of the Solution from the Coreset

3.3. Obliviousness to d min and d max

3.4. Improved Approximation under Bounded Doubling Dimension

4. Effective Diameter Estimation

5. Experiments

5.1. k-Center with Outliers

5.2. Effective Diameter

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3. $k$ -Center with $z$ Outliers

3.3. Obliviousness to $d_{\min}$ and $d_{\max}$