ARTÍCULO
TITULO

Efficient Group K Nearest-Neighbor Spatial Query Processing in Apache Spark

Panagiotis Moutafis    
George Mavrommatis    
Michael Vassilakopoulos and Antonio Corral    

Resumen

Aiming at the problem of spatial query processing in distributed computing systems, the design and implementation of new distributed spatial query algorithms is a current challenge. Apache Spark is a memory-based framework suitable for real-time and batch processing. Spark-based systems allow users to work on distributed in-memory data, without worrying about the data distribution mechanism and fault-tolerance. Given two datasets of points (called Query and Training), the group K nearest-neighbor (GKNN) query retrieves (K) points of the Training with the smallest sum of distances to every point of the Query. This spatial query has been actively studied in centralized environments and several performance improving techniques and pruning heuristics have been also proposed, while, a distributed algorithm in Apache Hadoop was recently proposed by our team. Since, in general, Apache Hadoop exhibits lower performance than Spark, in this paper, we present the first distributed GKNN query algorithm in Apache Spark and compare it against the one in Apache Hadoop. This algorithm incorporates programming features and facilities that are specific to Apache Spark. Moreover, techniques that improve performance and are applicable in Apache Spark are also incorporated. The results of an extensive set of experiments with real-world spatial datasets are presented, demonstrating that our Apache Spark GKNN solution, with its improvements, is efficient and a clear winner in comparison to processing this query in Apache Hadoop.

 Artículos similares

       
 
Naga Siva Pavani Peraka, Krishna Prapoorna Biligiri and Satyanarayana N. Kalidindi    
The demand for preserving existing roadway infrastructure has been increasing to regulate expensive reconstruction activities. The maintenance of homogeneous road sections is one of the approaches to economize the overall management of pavement systems. ... ver más
Revista: Infrastructures

 
Karima Khettabi, Zineddine Kouahla, Brahim Farou, Hamid Seridi and Mohamed Amine Ferrag    
Internet of Things (IoT) systems include many smart devices that continuously generate massive spatio-temporal data, which can be difficult to process. These continuous data streams need to be stored smartly so that query searches are efficient. In this ... ver más

 
Neelakandan Subramani, Sathishkumar Veerappampalayam Easwaramoorthy, Prakash Mohan, Malliga Subramanian and Velmurugan Sambath    
Twitter, Instagram and Facebook are expanding rapidly, reporting on daily news, social activities and regional or international actual occurrences. Twitter and other platforms have gained popularity because they allow users to submit information, links, ... ver más

 
Aopeng Xu, Zhiyuan Zhang, Xiaqing Ma, Zixiang Zhang and Tao Xu    
As a basic method of spatial data operation, spatial keyword query can provide meaningful information to meet user demands by searching spatial textual datasets. How to accurately understand users? intentions and efficiently retrieve results from spatial... ver más

 
Zhihao Wang, Ru Huo and Shuo Wang    
In smart grids, the access verification of a large number of intelligent gateways and terminal devices has become one of the main concerns to ensure system security. This means that smart grids need a new key management method that is safe and efficient ... ver más
Revista: Future Internet