The Clustering of Lambda Terms by Using Embeddings
Abstract
Relevance. The importance of optimizing compilers and interpreters for functional programming languages, mainly through the lens of Lambda Calculus, is paramount in addressing the increasing complexity and performance requirements in software engineering. The emphasis of this study lies in this critical area, aiming to leverage advanced machine learning techniques to enhance identification and application of code reduction strategy.
Goal. The primary goal is to improve the performance and efficiency of compilers and interpreters by deepening the understanding of program code reduction strategies within Lambda Calculus. The research is aimed at using machine learning to convert lambda terms into feature vectors, facilitating the exploration of optimal reduction strategies.
Research methods. The study employs a comprehensive approach, generating a wide range of lambda terms for analysis. It utilizes OpenAI's text embedding model to transform these terms into embedding vectors, employing clustering analyses (DBSCAN with Euclidean measurements) and visualizations (PCA and t-SNE) to identify patterns and assess feature separability. The research navigates the complexities of choosing between specific and universal reduction strategies.
The results. Findings have revealed clear distinctions among lambda term representations within the embedding vectors, supporting the hypothesis that cluster analysis can uncover identifiable patterns. However, the challenges have been encountered due to OpenAI Embeddings' training being generally focused on human-readable text and code, and that complicates the precise representation of Lambda Calculus terms.
Conclusions. This exploration underscores the challenges in pinpointing the optimal reduction strategy for Lambda Calculus terms, highlighting the limitations of current mathematical models and the need for tailored machine learning applications. Despite the hurdles with the OpenAI Embeddings model's adaptability, the research offers significant insight into the potential of machine learning to refine the optimization processes of compilers and interpreters in functional programming environments.
Downloads
References
/References
Chitil, O. (2020). Functional Programming. Clean C++20. doi: https://doi.org/10.1007/978-1-4471-3166-3.
Deineha, O., Donets, V., & Zholtkevych, G. (2023). On Randomization of Reduction Strategies for Typeless Lambda Calculus. Communications in Computer and Information Science 1980, pp. 25–38.
Deineha, O., Donets, V., & Zholtkevych, G. (2023). Estimating Lambda-Term Reduction Complexity with Regression Methods. International Conference "Information Technology and Interactions".
Ashouri, A.H., Bignoli, A., Palermo, G., Silvano, C., Kulkarni, S., & Cavazos, J. (2017). “MiCOMP: Mitigating the Compiler Phase-Ordering Problem Using Optimization Sub-Sequences and Machine Learning”. ACM Trans. Archit. Code Optim., 14, 29:1-29:28. doi: https://doi.org/10.1145/3124452.
Chen, J., Xu, N., Chen, P., & Zhang, H. (2021). Efficient Compiler Autotuning via Bayesian Optimization. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 1198-1209. doi: https://doi.org/10.1109/ICSE43902.2021.00110.
Cummins, C., Wasti, B., Guo, J., Cui, B., Ansel, J., Gomez, S., Jain, S., Liu, J., Teytaud, O., Steiner, B., Tian, Y., & Leather, H. (2021). CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research. 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 92-105. doi: https://doi.org/10.1109/CGO53902.2022.9741258.
Martins, L.G., Nobre, R., Cardoso, J.M., Delbem, A.C., & Marques, E. (2016). Clustering-Based Selection for the Exploration of Compiler Optimization Sequences. ACM Transactions on Architecture and Code Optimization (TACO), 13, 1 - 28. doi: https://doi.org/10.1145/2883614.
Ashouri, A.H., Bignoli, A., Palermo, G., Silvano, C., Kulkarni, S., & Cavazos, J. (2017). MiCOMP: Mitigating the Compiler Phase-Ordering Problem Using Optimization Sub-Sequences and Machine Learning. ACM Trans. Archit. Code Optim., 14, 29:1-29:28. doi: https://doi.org/10.1145/3124452.
Xavier, T.C., & Silva, A.F. (2018). Exploration of Compiler Optimization Sequences Using a Hybrid Approach. Comput. Informatics, 37, 165-185. doi: https://doi.org/10.4149/cai_2018_1_165.
Mammadli, R., Jannesari, A., & Wolf, F.A. (2020). Static Neural Compiler Optimization via Deep Reinforcement Learning. 2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar), 1-11. doi: https://doi.org/10.1109/LLVMHPCHiPar51896.2020.00006.
Runciman, C., & Wakeling, D. (1992). Heap Profiling of a Lazy Functional Compiler. Functional Programming. doi: https://doi.org/10.1007/978-1-4471-3215-8_18.
Chlipala, A. (2015). An optimizing compiler for a purely functional web-application language. Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming. doi: https://doi.org/10.1145/2784731.2784741.
Deineha, O., Donets, V., & Zholtkevych, G. (2023). Unsupervised Data Extraction from Transformer Representation. EEJET, vol. 3, In press.
Deineha, O., Donets, V., & Zholtkevych, G. (2023). Deep Learning Models for Estimating Number of Lambda-Term Reduction Steps. 3rd International Workshop of IT-professionals on Artificial Intelligence “ProfIT AI 2023”.
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). CodeBERT: A Pre-Trained Model for Programming and Natural Languages. doi: https://doi.org/10.18653/v1/2020.findings-emnlp.139.
Dwivedi, V.P., & Shrivastava, M. (2017). Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space. ICON.
Hartigan, J.A., & Wong, M.A. (1979). A k-means clustering algorithm. doi: https://doi.org/10.2307/2346830.
Zhang, Y., Li, M., Wang, S., Dai, S., Luo, L., Zhu, E., Xu, H., Zhu, X., Yao, C., & Zhou, H. (2021). Gaussian Mixture Model Clustering with Incomplete Data. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17, 1 - 14. doi: https://doi.org/10.1145/3408318.
Ros, F., Riad, R., & Guillaume, S. (2023). PDBI: A partitioning Davies-Bouldin index for clustering evaluation. Neurocomputing, 528, 178-199. doi: https://doi.org/10.1016/j.neucom.2023.01.043.
Lima, S.P., & Cruz, M.D. (2020). A genetic algorithm using Calinski-Harabasz index for automatic clustering problem. Revista Brasileira de Computação Aplicada. doi: https://doi.org/10.5335/rbca.v12i3.11117.
Li, X., Liang, W., Zhang, X., Qing, S., & Chang, P. (2020). A cluster validity evaluation method for dynamically determining the near-optimal number of clusters. Soft Computing, 24, 9227-9241. doi: https://doi.org/10.1007/s00500-019-04449-7.
Chung, Heewon, Hoon Ko, Wu Seong Kang, Kyung Won Kim, Hooseok Lee, Chul Park, Hyun-Ok Song, Tae-Young Choi, Jae Ho Seo and Jinseok Lee. “Prediction and Feature Importance Analysis for Severity of COVID-19 in South Korea Using Artificial Intelligence: Model Development and Validation.” Journal of Medical Internet Research 23 (2021): n. Pag. doi: https://doi.org/10.2196/27060.
Oliveira, F.H., Machado, A.R., & Andrade, A.D. (2018). On the Use of t-Distributed Stochastic Neighbor Embedding for Data Visualization and Classification of Individuals with Parkinson's Disease. Computational and Mathematical Methods in Medicine, 2018. doi: https://doi.org/10.1155/2018/8019232.
Chitil, O. (2020). Functional Programming. Clean C++20. doi: https://doi.org/10.1007/978-1-4471-3166-3.
Deineha, O., Donets, V., & Zholtkevych, G. (2023). On Randomization of Reduction Strategies for Typeless Lambda Calculus. Communications in Computer and Information Science 1980, pp. 25–38.
Deineha, O., Donets, V., & Zholtkevych, G. (2023). Estimating Lambda-Term Reduction Complexity with Regression Methods. International Conference "Information Technology and Interactions".
Ashouri, A.H., Bignoli, A., Palermo, G., Silvano, C., Kulkarni, S., & Cavazos, J. (2017). “MiCOMP: Mitigating the Compiler Phase-Ordering Problem Using Optimization Sub-Sequences and Machine Learning”. ACM Trans. Archit. Code Optim., 14, 29:1-29:28. doi: https://doi.org/10.1145/3124452.
Chen, J., Xu, N., Chen, P., & Zhang, H. (2021). Efficient Compiler Autotuning via Bayesian Optimization. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), 1198-1209. doi: https://doi.org/10.1109/ICSE43902.2021.00110.
Cummins, C., Wasti, B., Guo, J., Cui, B., Ansel, J., Gomez, S., Jain, S., Liu, J., Teytaud, O., Steiner, B., Tian, Y., & Leather, H. (2021). CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research. 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 92-105. doi: https://doi.org/10.1109/CGO53902.2022.9741258.
Martins, L.G., Nobre, R., Cardoso, J.M., Delbem, A.C., & Marques, E. (2016). Clustering-Based Selection for the Exploration of Compiler Optimization Sequences. ACM Transactions on Architecture and Code Optimization (TACO), 13, 1 - 28. doi: https://doi.org/10.1145/2883614.
Ashouri, A.H., Bignoli, A., Palermo, G., Silvano, C., Kulkarni, S., & Cavazos, J. (2017). MiCOMP: Mitigating the Compiler Phase-Ordering Problem Using Optimization Sub-Sequences and Machine Learning. ACM Trans. Archit. Code Optim., 14, 29:1-29:28. doi: https://doi.org/10.1145/3124452.
Xavier, T.C., & Silva, A.F. (2018). Exploration of Compiler Optimization Sequences Using a Hybrid Approach. Comput. Informatics, 37, 165-185. doi: https://doi.org/10.4149/cai_2018_1_165.
Mammadli, R., Jannesari, A., & Wolf, F.A. (2020). Static Neural Compiler Optimization via Deep Reinforcement Learning. 2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar), 1-11. doi: https://doi.org/10.1109/LLVMHPCHiPar51896.2020.00006.
Runciman, C., & Wakeling, D. (1992). Heap Profiling of a Lazy Functional Compiler. Functional Programming. doi: https://doi.org/10.1007/978-1-4471-3215-8_18.
Chlipala, A. (2015). An optimizing compiler for a purely functional web-application language. Proceedings of the 20th ACM SIGPLAN International Conference on Functional Programming. doi: https://doi.org/10.1145/2784731.2784741.
Deineha, O., Donets, V., & Zholtkevych, G. (2023). Unsupervised Data Extraction from Transformer Representation. EEJET, vol. 3, In press.
Deineha, O., Donets, V., & Zholtkevych, G. (2023). Deep Learning Models for Estimating Number of Lambda-Term Reduction Steps. 3rd International Workshop of IT-professionals on Artificial Intelligence “ProfIT AI 2023”.
Feng, Z., Guo, D., Tang, D., Duan, N., Feng, X., Gong, M., Shou, L., Qin, B., Liu, T., Jiang, D., & Zhou, M. (2020). CodeBERT: A Pre-Trained Model for Programming and Natural Languages. doi: https://doi.org/10.18653/v1/2020.findings-emnlp.139.
Dwivedi, V.P., & Shrivastava, M. (2017). Beyond Word2Vec: Embedding Words and Phrases in Same Vector Space. ICON.
Hartigan, J.A., & Wong, M.A. (1979). A k-means clustering algorithm. doi: https://doi.org/10.2307/2346830.
Zhang, Y., Li, M., Wang, S., Dai, S., Luo, L., Zhu, E., Xu, H., Zhu, X., Yao, C., & Zhou, H. (2021). Gaussian Mixture Model Clustering with Incomplete Data. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17, 1 - 14. doi: https://doi.org/10.1145/3408318.
Ros, F., Riad, R., & Guillaume, S. (2023). PDBI: A partitioning Davies-Bouldin index for clustering evaluation. Neurocomputing, 528, 178-199. doi: https://doi.org/10.1016/j.neucom.2023.01.043.
Lima, S.P., & Cruz, M.D. (2020). A genetic algorithm using Calinski-Harabasz index for automatic clustering problem. Revista Brasileira de Computação Aplicada. doi: https://doi.org/10.5335/rbca.v12i3.11117.
Li, X., Liang, W., Zhang, X., Qing, S., & Chang, P. (2020). A cluster validity evaluation method for dynamically determining the near-optimal number of clusters. Soft Computing, 24, 9227-9241. doi: https://doi.org/10.1007/s00500-019-04449-7.
Chung, Heewon, Hoon Ko, Wu Seong Kang, Kyung Won Kim, Hooseok Lee, Chul Park, Hyun-Ok Song, Tae-Young Choi, Jae Ho Seo and Jinseok Lee. “Prediction and Feature Importance Analysis for Severity of COVID-19 in South Korea Using Artificial Intelligence: Model Development and Validation.” Journal of Medical Internet Research 23 (2021): n. Pag. doi: https://doi.org/10.2196/27060.
Oliveira, F.H., Machado, A.R., & Andrade, A.D. (2018). On the Use of t-Distributed Stochastic Neighbor Embedding for Data Visualization and Classification of Individuals with Parkinson's Disease. Computational and Mathematical Methods in Medicine, 2018. doi: https://doi.org/10.1155/2018/8019232.