A Framework for Executing Complex Querying for Relational and NoSQL Databases (CQNS)

— The increasing of data on the web poses major confrontations. The amount of stored data and query data sources have become needful features for huge data systems. There are a large number of platforms used to handle the NoSQL database model such as: Spark, H 2 O and Hadoop HDFS / MapReduce, which are suitable for controlling and managing the amount of big data. Developers of different applications impose data stores on difficult tasks by interacting with mixed data models through different APIs and queries. In this paper, a complex SQL Query and NoSQL (CQNS) framework that acts as an interpreter sends complex queries received from any data store to its corresponding executable engine called CQNS. The proposed framework supports application queries and database transformation at the same time, which in turn speeds up the process. Moreover, CQNS handles many NoSQL databases like MongoDB and Cassandra. This paper provides a spark framework that can handle SQL and NoSQL databases. This work also examines the importance of MongoDB block sharding and composition. Cassandra database deals with two types of sections vertex and edge Portioning. The four scenarios criteria datasets are used to evaluate the proposed CQNS to query the various NOSQL databases in terms of optimization performance and timing of query execution. The results show that among the comparative system, CQNS achieves optimum latency and productivity in less time.

I. INTRODUCTION 1 The popularity of NoSQL systems is caused by their efficiency in handling unstructured data and backing up effective design schemes that give the system users supreme flexibility and scalability. This paper identifies a relational database and several categories of NoSQL databases with structural features: key-value, graph, column and document databases. Likewise, every NoSQL database has a special query language and does not support the criteria of other systems. The main problem that many researches focused on, is that there is no standard way for expressing, executing and optimizing complex queries across NoSQL databases [1], [4]. Currently, data stores have several diversified APIs. The programmers of applications based on multiple data stores must be familiar with these APIs during the process of coding these applications. As a result of the variety and changes in the data models of various databases, there is no standard way to solve the problem of implementing queries for various NoSQL data stores. The reason is because of a lack of a combined access model for diversified data stores. The 1 Published on September 22, 2020. Eman A. Khashan, Mansoura University, Egypt. Ali I. El-Desouky, Mansoura University, Egypt.
programmers must challenge themselves with the execution of these queries, which are hard to optimize. On the other hand, optimization puts certain criteria into consideration, such as data transformation and movement costs, which might be expensive for big data. Figure 1 shows a diagram of integrating heterogenous relational and NoSQL datasets to an example of scientific social network. All of these reasons encourage sharing in the interoperability between two or more varied and powerful frameworks. In this paper, Mongo and Cassandra [37] focus on being the most popular way to help companies make business decisions. Several researchers and developers have focused on this problem. The variety of relational and NoSQL data models (relational, key value, ordered key-value, document, semi-structured and graph databases) and query languages (SQL, Cassandra Query Language (CQL), MapReduce querying language, etc.) is the main difficulty. Salami et al. [1] Identify a common data model and use algebra to address complex declarative inquiries. In this technique, queries are handled in multiple data stores called VDS, that is, default data stores. The optimization stage is carried out by a two-step broker. First, the selection and project processes are defined down to the local data stores. This allows to reduce the amount of data exchange. Second, an ideal distributed plan is designed with a dynamic programming method. The distributed plan seeks to reduce I / O and CPU costs and to charge and convert data. However, this technique is limited to redressing an ODBAPI query and some query operators. Another method was developed by P.Sangat et al. [2] called DIMS. In DIMS Most data generated by ubiquitous sensing applications have the character of time series, such as monitoring data of power station, and from others a pattern of interrelationship emerges, for instance the correlation between patients, disease, and symptoms. Further, high sampling frequency and high data generation rate also feature. To satisfy the needs of various requirements, a data storage system should have various abilities, such as making different schemes and profiles for different applications. Song et al [15], they present the design, implementation and evaluation of Haery, a column dedicated to big data. Haery is built on Hadoop HDFS and distributed computing framework relying on MapReduce. Haery's download and query performance results are the most stable and effective. But there is more cost in time when data volume increases. Haery proposed the following models and algorithms: Key-Cube, an improved Z-order based linearization algorithm and an address tree, Accumulation, which is a key-cube expansion approach, Query algorithms to implement queries on keycubes and physical storage and the system architecture, components and implementation of Haery. The rest of this paper is organized as follows. In section 2, this paper presents the related work. In section 3, the proposed CQNS framework, which has three layers. In section 4, research work discusses implementation and evaluation of CQNS. Section 5 provides conclusions and future work.

II. RELATED WORK
Several researchers and developers have focused on Querying NoSQL Databases problem [4], [5], [20]. The variety of relational and NoSQL data models (relational, key value, ordered key-value, document, semi-structured and graph databases) and query languages (SQL, Cassandra Query Language (CQL), MapReduce querying language, etc.) is the main difficulty. G. Baruffa et al. [3] characterized a Spectrum Sensing that provides a service to allow end users to easily access and process wireless spectrum data. To reduce the latency of services provided by the platform, that adjust the data processing chain. They took an interest in Mongo and Cassandra databases and did not consider the rest of the databases. Khan et al. [4]. and Duggan et al. [5] it offers frameworks that called PolyWeb and BigDAWG, respectively. PolyWeb and BigDAWG retain data sources in a primary format, that is, without serializing them in a common data format. In PolyWeb, SPARQL queries are translated into the original query language for these sources. PolyWeb indexes each data source to predict the query and creates deep left plans. Despite the efficiency, the current methods are not able to exploit knowledge about the main features of integrated data sources, and produce custom query plans for selected sources to collect data from the data lake. In contrast, the QODM [6] approach produces distinct schema using the data model and data schema of an application for NoSQL databases. This approach will not prevent programmers from using any NoSQL database. document and relational data stores are integrated in a hybrid mediation approach proposed by Roijackers et al. [7]. However, this approach does not consider other NoSQL data stores. Sharma et al. [8] In this research paper studying the performance of RDBMS, Document based No SQL data base (MongoDB) and Graph based No SQL Data base (Neo4j). they got unexpected results in case of neo4j as it took longest time as compared to MongoDB and PostGre SQL. However, few NoSQL databases are supported only by these frameworks, so the programmer has to make designs for data models of an application and choose a proper strategy for data mapping. CQNS is a proposed framework for improving and estimating complex queries for relational databases and other types of NoSQL data stores. For this purpose, a unified data model is proposed that uses a suitable environment such as Apache Spark with MongoDB [18], [24] to optimize the qualification of the data ingestion process. The CQNS framework transforms each query process received from any dataset to the matched Engine after using Hadoop/HDFS and Hadoop/MapReduce with parallel k-means clustering for processing data without physical transformation data.

III. PROPOSED CQNS FRAMEWORK
This section introduces the proposed CQNS approach, which is capable of executing complex queries across heterogeneous data stores. This framework consists of three layers, as shown in Fig. 2: Matching Selector layer, processing layer, and query execution layer. In the following sections, this paper discusses the different layers of the proposed CQNS framework.

A. Matching Selector Layer
This layer receives any SQL or NoSQL database query to match the sentences of the query given by the user with the stored libraries that hold a number of statements for each database type either SQL or NoSQL from the database engine and then compares the sentence with the stored libraries to define the required database engine. This paper prepared a set of libraries for each of the databases that are studied, such as SQL as example of relational database and MongoDB, Cassandra as an example of NoSQL databases. Indeed, this approach symbolizes the combined parts among every deployed data storage and delivers a unified model to the following layer of the framework. This model contains the particular operations of every database. It is noteworthy that the user has to add a particular implementation of the data store if he/she needs to integrate an extra database. In the following figures, an explanation is given for testing the query statements for the databases used. This paper used SQL Server (as an example of a relational database), MongoDB and Cassandra (as examples of NoSQL databases).      Algorithm 1 illustrates the method of discovering the database type of the query to be executed based on the libraries stored in the application to show the selected engine database.
According to Algorithm 1, the results of matching patterns and input values, one of the following decisions will be followed: If the patterns are identical to the SQL database, the application will continue to run the path of the SQL database. If the patterns are identical to the Mongo database, the application will continue to run the path of the Mongo database. If the patterns are appropriate for the Cassandra database, the path for the Cassandra database will be followed.
If you want to apply patterns to other databases, you must add their own libraries.

Update Operation
Update customers set position='USA' where id=25

B. CQNS Processing Layer
Information on the technologies used and the setup environment for this experiment is provided briefly in this section. CQNS deployed and used Hadoop/HDFS [1] to store the incoming data.
Hadoop is an open source distributed computing platform that mainly consists of the distributed computing framework MapReduce and the distributed document system HDFS [3]. MapReduce [9], [22] is a software platform for parallel processing programming of large-scale data pieces. The MapReduce strategy is applied to the k-means clustering algorithm and clustered for the data factors. The k-means [19] algorithm can be successfully parallelized and clustered on hardware resources. MapReduce can be utilized for k-means clustering. The results also show that the clusters shaped using MapReduce are similar to the clusters produced using a sequential algorithm. Once HDFS takes data, this process breaks information down into separate blocks and distributes those blocks to different nodes in the cluster, thus enabling highefficiency parallel processing. The data from HDFS is accessed by a Spark streaming program for handling before being stored in MongoDB in the server of the database. Resilient distributed datasets (RDDs) are an abstraction presented by Spark [13]. RDDs symbolize a read-only multiset of data objects divided into a group of machines that continue operating as designed despite internal or external changes (fault-tolerant way). Spark is considered the first system of programming languages in general and is used as an interactive way to handle big data [36] sets for clustering. A Complex Querying over NoSQL databases Algorithm (CQNSA) using MongoDB and the MongoDB Connector for Spark is proposed using an open source NoSQL database that is designed for high scalability, effectiveness, and availability. This CQNSA is shown in Algorithm 2.

C. Query Execution Layer
Instead of storing the data as tables with columns and rows, the data are stored as documents. Every document can be one of the relational matrices of the numerical values or the overlapping interrelated arrays or matrices. These documents are serialized as JSON objects and stored internally using JSON binary encryption known as BSON in MongoDB; the data is partitioned and stored on several servers called shard servers for simultaneous access and effective read/write operations. MongoDB and Apache Spark are integrated seamlessly by this connector. MongoDB aggregation pipelines and a problem of how to assign a group of objects into groups, called blocks, so that the objects within the same group, partitioning is by using a cluster assignment function C:X→{1,2,….,k} when X is a set of objects, the Number of clusters ∈ ℤ + and Distance function ∈ ℛ0 + between all pairs of objects in X, partition X into K disjoint sets x1,x2,…..,x_k such that ∑ ∑ ( , ′ ) , ′ ∈ with N = |X|, the number of distinct cluster assignments possible as follows [33]:

D. MongoDB Engine
Sharding is a way to distribute data across multiple devices. This paper presents MongoDB, which uses sharding to support deployments using very large datasets and highproductivity processes. Database systems that contain large datasets or high-productivity applications can challenge the capacity of a single server. For example, high query rates can exhaust the CPU capacity for the server. A range of sizes greater than the system's RAM can help to confirm the I/O capacity of the drivers. A database can have a mixture of sharded and unsharded collections. Sharded collections are partitioned and distributed across the shards in a cluster. Unsharded collections are stored on a primary shard. Each database has its own primary shard as shown in Fig. 4. In addition to JSON schema validation, MongoDB manages validation with query filter expressions using query operations, with the exceptions of $near, $nearSphere, $text, and $where. Figure 5 explains a JSON example of specifying validator rules using the following query expression:

E. Cassandra Engine
The Apache Cassandra database has linear scalability and proven tolerance for hardware or cloud infrastructure, and these attributes make this database an ideal platform for important data. This paper presents replication supported by the Cassandra database across multiple data centers that is best in class, providing less downtime for users and peace of mind by knowing that it can overcome regional interruptions. This paper proposes two kinds of partitioning methods that can work with the Cassandra database: vertex partitioning and edge partitioning. Later, this study will introduce how can research paper dealing with these methods. This paper investigates vertex partitioning and edge partitioning to show differences in the results about them. This paper investigates vertex partitioning and edge partitioning to show differences in the results about them.

F. Data Partitioning
Cassandra divides the database into smaller, partially overlapping datasets that are stored locally on each node. Thus, unlike other NoSQL databases such as HBase, Cassandra does not require a shared file system (for example, HDFS). A hash function is used to distribute basic registry keys for the nodes. This process is performed by dividing the scope of the hash key into subdomains called partitions (also called token ranges). In blocks without repeating (RF = 1), each node can be configured to store unique partitions locally. In this section, the necessary background will be provided and presented with the data and account models we target. Table 1 contains the partitions variables used in this paper. This paper uses formula (2) to calculate the size of data partitions. [35]

IV. CQNS EVALUATION
CQNS is utilized to store, manage and execute queries of big data and greatly facilitates the developer's task. In this paper, the proposed model rewrites each query into the particular query language of the integration data store. The processing layer in CQNS turns results into a suitable format such as JSON before responding to the system users. Therefore, the overhead is considered reasonable to some extent. Because of memory management trouble in the driver, there is a probability that the performance of CQNS will degrade after 50000 entities. The results of experiments testing MongoDB and Cassandra DB are shown in the following sections.

A. Datasets
To assess the suggested framework strategy for querying NoSQL databases, the research work performed experimental tests using four scenario benchmark datasets for hybrid datasets. A description of the dataset used in this paper can be downloaded from http://snap.stanford.edu/data/. All of the datasets are related to complex querying. The datasets used in our research have a large number of features (thousands of rows) that are suitable for displaying the efficiency of our strategy for high-dimensional data. The ego-Facebook, ego-Twitter and soc-LiveJournal1 datasets were acquired from different types of database engines.

B. System analysis details
This section provides brief information on the techniques used and preparing the environment for the experiment. This study creates and publishes Hadoop, HDFS, Apache Spark, MongoDB, MongoDB link for Spark, Cassandra DB connector, Cassandra DB for spark server and SQL on the environment with the following specifications. 3. Database Connector for Spark: This connector provides a seamless integration between matched Database and Apache Spark. It effectively uses database assembly lines and secondary indexes to extract, filter and process the sub-data required for the Spark process. Additionally, to maximize performance over a large distribute Datasets, they link the RDDs to the source database node and reduce the data transfer across the cluster.4. Hardware: The Spark app server has 16 dedicated hubs, 64GB of memory, 459GB of hard drives, and 64-bit Ubuntu GNU / Linux. The mongo dB database server and both Shard 4 server have dedicated cores, 16GB memory and 130GB HDD, while each of the initialization servers contains 1 hard disk, 4GB and 30GB. The Cassandra database server. 16 GB memory, 130 GB hard drives, and Microsoft SQL 2017 server has been installed with the same infrastructure specifications previously mentioned.

C. Cost Model
The cost of implementation is the sum of the costs of each process that composes the implementation plan. It should be noted that the cost does not directly represent time. Of course, more cost means more time. It is used to compare two query execution plans, but not to directly estimate response time. To evaluate the cost formula, the matrix multiplication between the row vector containing the coefficients , , and γ was calculated. A column vector contains the values of the parameters defined in the catalog, and a fixed variable called const which is a scalar and can be a cardinality, selectivity, etc. In addition, if the parameter does not depend on a specific measurement (CPU cost, I/O cost, or cost of connections), this will take the latter an empty value in the column vector. Matrix multiplication [1] is calculated as follows: Performance estimation is an important point for a new framework. This estimation is shown in the outcomes of total cost, average time, and ingestion rate. Table 2 displays sample catalogue variables. These outcomes are utilized to estimate the efficiency of the proposed framework. The outcomes are calculated by the following equations:

D. CQNS Join Queries
The purpose of these experiments is to estimate the impact and validate the proposed optimization process. Thus, various designs utilized to optimize the process are tested to ensure their efficiency and integration. The purpose of this paper is to contrast various strategies to measure both the response time and the latency time precisely and accurately. This purpose is possible due to experiments that are performed in this proposal using two variables of join queries. For variable No. 1, every entity set might join not less than one entity set and not more than two entity sets, creating a linear form, which is generally utilized in several applications. For the second variable, all entity sets executed in a query join a similar entity set, creating a star form, which is widely used in data warehouse applications. The two variables of queries are implemented through four different possibilities. It is necessary to know whether the sub-queries in these possibilities are implemented in a parallel or sequential way and whether external join implementations are enabled or disabled.
• Possibility 1: This possibility will occur at the time of running a join query sequentially while just utilizing the VDS and integrated data store for push-down operations (most simple possibility).
• Possibility 2: This possibility will occur at the time of executing a join query sequentially utilizing both the VDS and integrated data store for exterior joins and the push-down process.
• Possibility 3: This possibility will occur at the time of executing a join query in parallel while just utilizing the VDS. In this case, to implement push-down operations, parallelism is performed in the integrated data store.
• Possibility 4: This possibility will occur at the time of operating a join query in parallel while just utilizing the VDS and integrated data store. The implementation of the external joins and the push-down process will be parallelized. Furthermore, the experiments are performed with two datasets that are different in size. The first of these datasets is medium-sized, while the second is large-sized. Note that these various sizes are multiplied by hundreds of factors. There are ten entity sets that are different in size, starting from some megabytes and ascending to hundreds of megabytes, used in the experiments and saved on five various types of data stores. The experiments are conducted by proposing to utilize four querying processes, such as linear and star joins, and increasing the join counts for both from three to ten. Moreover, the subsequent constraints on the variables in the catalogue are considered. In fact, the variables in the catalogue are considered. The variables a, b and c shown in Table 2 symbolize CPU timing, input/output timing and connection timing, which are correspondingly assigned to five data stores and the VDS.  Init_Join  Join_No  D1  2  6  6  2  2  20  EntityA=4000000  5  0; 2  D2  2  8  8  2  2  20  EntityB=1000  20  4  D3  2  2  2  2  2  2  EntityC = 3000  2  2  D5  2  6  6  2  2  2  Entity= 10000  6  16  VDS  2  2  2  2  2  8  __ 10 2

E. CQNS Linear Join Experiments
According to possibilities 1 to 4 and the number of joins, the overall time estimations from the time model are presented in Figures 7a and 7b. In Figures 8a and 8b, the results are explained for accessing very large-sized data, such as Big Data-Context. Every cardinality is multiplied by 100. Based on these results, the main aspects of this proposition are proved for creating an optimal implementation plan without dependence on the variable of the join query. The following conclusions have been obtained on the basis of the results shown in Figures 7c, 7d, 8c and 8d. In fact, the assumption in this paper is that only data stores D1, D2 and D3 support the join query implementation. Subsequently, the integrated data store is supplied to be able to save one or more multiple entity sets, e.g., in storage area D2. It is remarkable that the cardinality demonstrated in the chart shows the number of entities in the proposed entity set for a mediumsized dataset. The external joins show the following significance: both possibility 2 and possibility 4 indicate the importance of utilizing the integrated data store in the implemented join queries. In fact, a better aggregate cost is obtained in comparison to that of the possibilities that are executed using the VDS. The gain in possibilities 1 and 2 and that in possibilities 3 and 4 are computed to obtain the average gain, which is equal to 92.23%. The significance of the parallelism is as follows: Parallelized implementation of external joins and push-down operations proved to be advantageous and created a significant gain compared to that of the two other possibilities (in which the implementation is sequential). By comparison with the other possibilities, the average gain is equal to 86.67%.

F. CQNS Star Joins Experiments
When possibilities 3 and 4 are applied using star joins, the gain becomes great. In fact, this result is due to the parallel implementation of all compression operations at the same time. The importance of integrating different optimization strategies is as follows: For possibility 4, the work of the integrated data store is maximized, and queries are executed in parallel. The possibility having the best total cost compared to the simple cost is possibility 1. In fact, the average gain was found to be 99.98%. Figures 8-a,8

G. CQNS Optimization Time in MongoDB
An adaptive schema is proposed to improve the performance and time of query execution, putting the size of the research area into consideration. Experiments that maximize the work of the integrated data store are shown in Fig. 9 and 10 and are very significant, particularly when this approach is utilized to execute very large quantities of data in addition to parallelism. The significance of merging the previous layers to obtain cost optimization is also proved in these experiments. The greater the number of records increased, the more the MongoDB sharding efficiency decreased, compared with that of the No-Sharding databases, as shown in Fig. 9, and this decreased efficiency influenced the execution time. By comparing Spark core and Spark SQL, the experiments found that the former differs from the latter because Spark SQL is loaded first. There is an increase in time, as shown in Fig.10.

H. An Experiments on Cassandra DB
The latency of read and write operations for vertex partitioning and edge partitioning when varying the number of nodes, is compared in the same environment of testing for possibilities 1, 2 and 3. The results for workloads A and C are shown in Fig. 11,12, respectively. From the last two experiments, it is obvious that among the compared systems, the CQNS framework, when dealing with the Cassandra dataset with partitioning, achieves the best latency throughput values compared with those without using partitioning and using partitioning.

I. Performance and cost over CQNS and Haery framework
Since CQNS writes data in HDFS directly without writing data, Figure 13 shows the average download speed for HBase, Cassandra, MongoDB, CQNS with Cassandra and CQNS with Mongo on data sets of different data sizes. In the download experience, CQNS's upload performance averaged 0.2, 10.2, 1.7, and 20.23 times higher than HBase, Cassandra, MongoDB, and Haery, to continue. the Cassandra with CQNS framework achieved better results than Haery [15], but the results of Haery using Mongo database are relatively better than CQNS results with no sharding.On the other hand, when CQNS applied the sharding technique, the results obtained are better than results obtained from Mongo and Cassandra databases, when using a large size of data.

J. Results Analysis and Discussion
In this section, the core algorithms of CQNS and the results are evaluated and compared with some popular NoSQL and relational databases using generated datasets and various query workload. The experiments conducted provided a comparison between CQNS and ODBPI to measure the cost time based on the number of joins. The results obtained, when using the Hadoop and spark, reflects higher performance compared to recent algorithms.
CQNS evaluation implemented two types of joins: linear join and star join. And the results obtained from linear joins proved that they are better than star joins results. In addition, a comparison with the DIMS framework to measure the average time and ingestion time have been implemented using two different databases: Mongo and Cassandra databases. The CQNS results when using Mongo with sharding technique is better than using Mongo without sharding technique, especially when the size of the data is very large. When the comparison was done on the Cassandra database, it got better results when using the portioning technique than using it without portioning. This study also compared two types of partitioning in Cassandra database. According to CQNS experiments the edge portioning has got better results than vertex, especially when using a large size of data. When adding the comparison process with Haery framework, the Cassandra with CQNS framework achieved better results than Haery, but the results of Haery using Mongo database are relatively better than CQNS results with no sharding. On the other hand, when CQNS applied the sharding technique, the results obtained are better than results obtained from Mongo and Cassandra databases, when using a large size of data.

V. CONCLUSION AND FUTURE WORK
It is difficult to perform complex queries across many data stores. Unlike applications that use a single data store, there is a request for additional efforts. This paper introduced a framework to handle complex query database. This framework consists of three layers. The first of which is responsible for defining the database engine that id matched with the user query sentences. In second layer the system sends user queries to a processing layer containing Hadoop HDFS to store data, the k-aggregation algorithm with MapReduce. The last layer either works with the sql engine or the selected NOSQL Engine to do the job required. It should be noted that firstly, a vector holding the names of the SQL and NOSQL Engine is created to help in defining the database engine matched with the user query. This paper proposes a time model for calculating time cost and therefore it used Sharding technology with Mongo database queries to segment data and reduce the time used to query. On the other hand, this study used two types of partitioning in Cassandra database, one of them is the Edge and the second is Vertex.

V E R T E X P A R T I T I O N I N G
Possibility A Possibility B Possibility C