The issue of Big Data testing is sufficiently important to be on the EU’s agenda until 2020. The goal is to create a unified testing infrastructure for governance purposes. Their collaborative effort is targeted towards collective learning and saving time that would otherwise be used to develop the same solution in parallel. This could be inspirational for companies working with big data.
However, big data is a deceiving name, since its most significant challenges are related not only to volume but the other two Vs (variety and velocity). An enormous amount of data which is constantly refreshing and updating is not only a logistical nightmare but something that creates accuracy challenges. The real question is, 'How can a company make sure that the petabytes of data they own and use for the business are accurate?'
The 3 stages of Big Data testing
Due to the differences in structure found in big data, the initial testing is not concerned with making sure the components work the way they should, but that the data is clean, correct and can be fed in the algorithms.
This change comes from the fact that algorithms feeding on Big Data are based on deep learning and enhance themselves without external intervention possible. If data is flawed, results will be the same. In case of relational databases, this step was only a simple validation and elimination of null recordings, but for big data it is a process as complex as software testing. Big data testing includes three main components which we will discuss in detail.
1.Data validation (pre-Hadoop)
Big data comes in three structural flavors: tabulated like in traditional databases, semi-structured (tags, categories) and unstructured (comments, videos). Each bit of information is dumped in a 'data lake,' a distributed repository that only has very loose charting, called schema.
Before any transformation is applied to any of the information, the necessary steps should be:
● Checking for accuracy. Ensuring that all the information has been transferred to the system in a way that can be read and processed, and eliminating any problems related to incorrect replication.
● Validating data types and ranges so that each variable corresponds to its definition, and there are no errors caused by different character sets. As an example, some financial data use “.” As a delimiter, others use “,” which can create confusion and errors.
● Cross-validation. Make sure the data is consistent with other recordings and requirements, such as the maximum length, or that the information is relevant for the necessary timeframe.
● Structured validation. Combine variables and test them together by creating objects or sets. As an example, instead of testing name, address, age and earnings separately, it’s necessary to create the “client” object and test that.
2. Map reducing validation and checking the logic
Map reducing takes Big data and tries to input some structure into it by reducing complexity. To promote parallel processing, the data needs to be split between different nodes, held together by a central node. In this case, the minimal testing means:
● Checking for consistency in each node, and making sure nothing is lost in the split process.
● Validating that the expected map-reduce operation is performed, and key-value pairs are generated.
● Making sure the reduction is in line with the project’s business logic. Checking this for each node and for the nodes taken together.
● Checking that processing through map reduce is correct by referring to initial data.
Some clients cold-offer real data for test purposes, others might be reluctant and ask the solution provider to use artificial data. Unfortunately, when dummy data is used, results could vary, and the model could be insufficiently calibrated for real-life purposes.
At the end of the map-reducing process, it’s necessary to move the results to the data warehouse to be further accessed through dashboards or queries. Here, testing is related to:
● Checking that no data was corrupted during the transformation process or by copying it in the warehouse.
● Making sure aggregation was performed correctly.
● Validating that the right results are loaded in the right place.
Setup logic testing
Getting the data clean is just the first step in processing. The 3Vs can still have a significant impact on the performance of the algorithms if two other dimensions are not adequately tested. Architecture and performance testing check that the existing resources are enough to withstand the demands and that the result will be attained in a satisfying time horizon. Sometimes this means almost instantaneously, like when we search for a certain song via Sound Hound.
A great architecture design makes data just flow freely and avoids any redundancy, unnecessary copying and moving the data between nodes. It should also eliminate sorting when not dictated by business logic and prevent the creation of bottlenecks.
The Hadoop architecture is distributed, and proper testing ensures that any faulty item is identified, information retrieved and re-distributed to a working part of the network.
This is the only bit of Big Data testing that still resembles traditional testing ways. The focus is on memory usage, running time, and data flows which need to be in line with the agreed SLAs. The role of performance tests is to understand the system’s limits and prepare for potential failures caused by overload. Testing is performed by dividing the application into clusters, developing scripts to test the predicted load, running tests and collecting results.
The nature of the datasets can create timing problems since a single test can take hours. In this case, Big Data automation is the only way to develop Big Data applications in due time.
Traditional software testing is based on a transparent organization, hierarchy of a system’s components and well-defined interactions between them. Conversely, Big Data testing is more concerned about the accuracy of the data that propagates through the system, the functionality and the performance of the framework. Due to the large volume of operations necessary for Big Data, automation is no longer an option, but a requirement.