Generative AI for Intelligent Data Lineage Prediction in Dynamic Data Environments
-
Published 2026-01-11
Data Lineage, Generative AI, Large Language Models (LLMs), Dynamic Data Environments, Data Engineering, Metadata Inference, Schema Evolution, Data Observability, Machine Learning for Data Systems, Graph Generation Issue
Section
ArticlesHow to Cite
[1]N. Moradi and H. Hasan, “Generative AI for Intelligent Data Lineage Prediction in Dynamic Data Environments”, IJDEIC, vol. 1, no. 1, pp. 41–49, Jan. 2026, Accessed: Mar. 02, 2026. [Online]. Available: https://worldcometresearchgroup.com/index.php/ijdeic/article/view/48Abstract
In today’s fast-paced data-driven enterprises, maintaining accurate and up-to-date data lineage is critical for governance, compliance, and trust. However, traditional lineage solutions often struggle to keep pace with the complexity and dynamism of modern data environments, characterized by frequent schema changes, evolving pipelines, and heterogeneous sources. This paper proposes a novel approach to intelligent data lineage prediction using Generative AI. By leveraging models such as large language models (LLMs) and graph-based generative techniques, we enable the automatic inference and prediction of lineage relationships, even in the presence of incomplete or evolving metadata. We present a modular framework that ingests system logs, transformation logic, and schema evolution data to learn latent patterns and generate accurate lineage graphs. Our experimental results demonstrate significant improvements over traditional and supervised methods in both accuracy and adaptability. This work paves the way for AI-native data observability systems that can evolve alongside the data they monitor.
References
[1] Amershi, S., et al. (2019). Software engineering for machine learning: A case study. ICSE.
[2] Batini, C., & Scannapieco, M. (2016). Data and Information Quality: Dimensions, Principles and Techniques. Springer.
[3] Bose, R. P., & van der Aalst, W. M. (2009). Abstractions in process mining: A taxonomy of patterns. BPM.
[4] Buneman, P., Khanna, S., & Tan, W. C. (2001). Why and where: A characterization of data provenance. ICDT.
[5] Cheung, A., et al. (2015). Optimizing database-backed applications with query synthesis. PLDI.
[6] Chiticariu, L., et al. (2010). SystemT: An algebraic approach to information extraction. ACL.
[7] Gao, J., et al. (2021). Data Lineage at Netflix. VLDB.
[8] Hassan, A., et al. (2022). LLM4Data: Applying large language models to data engineering. arXiv preprint arXiv:2209.13981.
[9] Liang, C., et al. (2023). LLMs as SQL Engines: A study of capabilities and limitations. arXiv:2303.13547.
Downloads
- ga
How to Cite
[1]N. Moradi and H. Hasan, “Generative AI for Intelligent Data Lineage Prediction in Dynamic Data Environments”, IJDEIC, vol. 1, no. 1, pp. 41–49, Jan. 2026, Accessed: Mar. 02, 2026. [Online]. Available: https://worldcometresearchgroup.com/index.php/ijdeic/article/view/48