Enhancing Reliability in Distributed Cloud Systems with a Systematic Review of Fault Tolerance Techniques for Resilient Infrastructure Design

Authors

  • Umang Garg USA Author

Keywords:

Distributed Cloud Systems, Fault Tolerance, Resilient Infrastructure, Reliability, Systematic Review

Abstract

Distributed cloud systems are critical enablers of modern computing, yet their reliability is often challenged by failures caused by hardware faults, software bugs, and network issues. This paper systematically reviews fault tolerance techniques to enhance the reliability of distributed cloud systems. By synthesizing findings from original research, we identify effective strategies for resilient infrastructure design, including replication, checkpointing, and machine learning-based fault prediction. Through comparative analysis, we highlight trade-offs in performance, cost, and implementation complexity. Insights from this study can guide practitioners in designing fault-tolerant architectures to ensure service continuity in distributed cloud environments.

References

Smith, J., et al. "Dynamic Replication Strategies for Cloud Systems." Journal of Cloud Computing, 2018.

Liu, Y., and Zhang, W. "Distributed Checkpointing for Large-Scale Cloud Systems." IEEE Transactions on Cloud Computing, 2017.

Kumar, R., et al. "Machine Learning for Anomaly Detection in Cloud Systems." ACM Transactions on Cloud Computing, 2016.

Jones, A., and Taylor, B. "Consistency Trade-offs in Replicated Systems." Cloud Systems Review, 2020.

Wang, H., et al. "Reinforcement Learning for Resource Optimization in Fault-Tolerant Systems." IEEE Transactions on Networking, 2021.

Roy, K., et al. "Hierarchical Checkpointing for Multi-Cloud Environments." Journal of Parallel and Distributed Computing, 2020.

Patel, S., et al. "Energy-Efficient Checkpointing Mechanisms for IoT-Enabled Cloud Systems." Future Generation Computer Systems, 2019.

Reddy, P., and Singh, M. "Real-Time Failure Prediction in Cloud Environments Using Deep Learning Models." International Journal of Cloud Computing, 2018.

Chen, X., et al. "Hybrid Replication Techniques for Optimizing Fault Tolerance in Cloud Systems." IEEE Transactions on Cloud Computing, 2021.

Smith, L., and Zhang, T. "A Survey of Fault Tolerance in Distributed Cloud Systems." ACM Computing Surveys, 2015.

Downloads

Published

2022-07-21

How to Cite

Umang Garg. (2022). Enhancing Reliability in Distributed Cloud Systems with a Systematic Review of Fault Tolerance Techniques for Resilient Infrastructure Design. INTERNATIONAL JOURNAL OF ENGINEERING AND TECHNOLOGY RESEARCH & DEVELOPMENT, 3(2), 1–4. https://ijetrd.com/index.php/ijetrd/article/view/IJETRD.03.02.001