Enhancing Reliability in Distributed Cloud Systems with a Systematic Review of Fault Tolerance Techniques for Resilient Infrastructure Design
Keywords:
Distributed Cloud Systems, Fault Tolerance, Resilient Infrastructure, Reliability, Systematic ReviewAbstract
Distributed cloud systems are critical enablers of modern computing, yet their reliability is often challenged by failures caused by hardware faults, software bugs, and network issues. This paper systematically reviews fault tolerance techniques to enhance the reliability of distributed cloud systems. By synthesizing findings from original research, we identify effective strategies for resilient infrastructure design, including replication, checkpointing, and machine learning-based fault prediction. Through comparative analysis, we highlight trade-offs in performance, cost, and implementation complexity. Insights from this study can guide practitioners in designing fault-tolerant architectures to ensure service continuity in distributed cloud environments.
References
Smith, J., et al. "Dynamic Replication Strategies for Cloud Systems." Journal of Cloud Computing, 2018.
Liu, Y., and Zhang, W. "Distributed Checkpointing for Large-Scale Cloud Systems." IEEE Transactions on Cloud Computing, 2017.
Kumar, R., et al. "Machine Learning for Anomaly Detection in Cloud Systems." ACM Transactions on Cloud Computing, 2016.
Jones, A., and Taylor, B. "Consistency Trade-offs in Replicated Systems." Cloud Systems Review, 2020.
Wang, H., et al. "Reinforcement Learning for Resource Optimization in Fault-Tolerant Systems." IEEE Transactions on Networking, 2021.
Roy, K., et al. "Hierarchical Checkpointing for Multi-Cloud Environments." Journal of Parallel and Distributed Computing, 2020.
Patel, S., et al. "Energy-Efficient Checkpointing Mechanisms for IoT-Enabled Cloud Systems." Future Generation Computer Systems, 2019.
Reddy, P., and Singh, M. "Real-Time Failure Prediction in Cloud Environments Using Deep Learning Models." International Journal of Cloud Computing, 2018.
Chen, X., et al. "Hybrid Replication Techniques for Optimizing Fault Tolerance in Cloud Systems." IEEE Transactions on Cloud Computing, 2021.
Smith, L., and Zhang, T. "A Survey of Fault Tolerance in Distributed Cloud Systems." ACM Computing Surveys, 2015.
Downloads
Published
Issue
Section
License
Copyright (c) 2022 Umang Garg (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.