Large-Scale Data Processing Using Distributed Computing Frameworks
-
Published 2026-01-04
Large-scale data processing, Distributed computing, Big data analytics, MapReduce, Apache Spark, Cluster computing, Scalability, Fault tolerance Issue
Section
ArticlesHow to Cite
[1]P. B, “Large-Scale Data Processing Using Distributed Computing Frameworks”, IJADSMC, vol. 1, no. 1, pp. 27–39, Jan. 2026, Accessed: Mar. 02, 2026. [Online]. Available: https://worldcometresearchgroup.com/index.php/ijadsmc/article/view/50Abstract
The explosive increase in digital information newest as a result of the social media sites, Internet of Things (IoT) devices, enterprise information systems, scientific simulations, and e-business programs has radically changed the computing needs of current-day data analytics. The conventional centralized designs of data processing architecture can not handle the volume, speed, and characteristics of such data, leading to scalability bottlenecks, high latency, and lower fault tolerance. The distributed computing setups have become a key facilitator of massive data processing through the harnessing of parallelism, data locality, and elasticity of resources on groupings of commodity hardware. The current paper is the in-depth study of the large-scale data processing in the context of distributed computing structures. It looks at the architectural concepts, programming models and mechanisms of execution used to implement contemporary distributed data processing systems. This paper critically evaluates leading systems like Hadoop MapReduce, Apache Spark, and Apache Flink systems and how they evolved to be based around batch processing rather than a hybrid batch/stream processing model. An extensive literature review brings together the previous studies carried out on scalability, fault tolerance, scheduling and performance optimization in a distributed environment. The suggested methodology comes up with a distributed data processing architecture that is layered and incorporates the resource intelligent resource management, parallel execution engines, and scalable storage. There are mathematical data partitioning, execution cost, and scalability mathematical formulations that are used to formalize system behavior. Experimental measurements based on the benchmark workloads show that there is a high increase in the throughput, execution time, and fault recovery against the traditional centralized systems. The trade-offs between frameworks analyzed in the discussion are based on latency, resource efficiency, and programming complexity. The paper ends by presenting the questionable opportunities to open research, such as scheduling of resources adaptively, data processing energy-efficiently, and applying artificial intelligence to autonomous optimization. The results are very helpful to researchers and practitioners who would have to create the next-generation of large-scale data analytics platforms.
References
[1] D. DeWitt and J. Gray, “Parallel database systems: The future of high performance database systems,” Communications of the ACM, vol. 35, no. 6, pp. 85–98, Jun. 1992.
[2] M. Stonebraker et al., “The case for shared nothing,” IEEE Database Engineering Bulletin, vol. 9, no. 1, pp. 4–9, Mar. 1986.
[3] T. Rauber and G. Rünger, Parallel Programming: For Multicore and Cluster Systems, 2nd ed. Berlin, Germany: Springer, 2013.
[4] I. Foster, Designing and Building Parallel Programs. Reading, MA, USA: Addison-Wesley, 1995.
[5] J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.
[6] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google file system,” in Proc. 19th ACM Symp. Operating Systems Principles (SOSP), 2003, pp. 29–43.
[7] T. White, Hadoop: The Definitive Guide, 4th ed. Sebastopol, CA, USA: O’Reilly Media, 2015.
[8] M. Zaharia et al., “Improving MapReduce performance in heterogeneous environments,” in Proc. 8th USENIX Conf. Operating Systems Design and Implementation (OSDI), 2008, pp. 29–42.
[9] M. Zaharia et al., “Spark: Cluster computing with working sets,” in Proc. 2nd USENIX Conf. Hot Topics in Cloud Computing (HotCloud), 2010, pp. 1–7.
[10] M. Zaharia et al., “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in Proc. 9th USENIX Conf. Networked Systems Design and Implementation (NSDI), 2012, pp. 15–28.
[11] P. Carbone et al., “Apache Flink: Stream and batch processing in a single engine,” IEEE Data Engineering Bulletin, vol. 38, no. 4, pp. 28–38, Dec. 2015.
[12] T. Akidau et al., “The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing,” Proc. VLDB Endowment, vol. 8, no. 12, pp. 1792–1803, Aug. 2015.
[13] S. Chintapalli et al., “Benchmarking streaming computation engines: Storm, Flink and Spark Streaming,” in Proc. IEEE Int. Parallel and Distributed Processing Symp. Workshops (IPDPSW), 2016, pp. 1789–1792.
[14] A. Verma et al., “Large-scale cluster management at Google with Borg,” in Proc. 10th Eur. Conf. Computer Systems (EuroSys), 2015, pp. 1–17.
[15] K. V. Rashmi, M. Zaharia, and R. Katz, “Fault tolerance in distributed systems: A comparative evaluation,” ACM Computing Surveys, vol. 48, no. 1, pp. 1–34, Sep. 2015.
[16] S. K. Sunkara, A. I. Ashirova, Y. Gulora, R. R. Baireddy, T. Tiwari and G. V. Sudha, "AI-Driven Big Data Analytics in Cloud Environments: Applications and Innovations," 2025 World Skills Conference on Universal Data Analytics and Sciences (WorldSUAS), Indore, India, 2025, pp. 1-6, doi: 10.1109/WorldSUAS66815.2025.11199123.
Downloads
- ga
How to Cite
[1]P. B, “Large-Scale Data Processing Using Distributed Computing Frameworks”, IJADSMC, vol. 1, no. 1, pp. 27–39, Jan. 2026, Accessed: Mar. 02, 2026. [Online]. Available: https://worldcometresearchgroup.com/index.php/ijadsmc/article/view/50