Research Projects

Development of an on-Premise AI Product Recommendation Model Using Open-Source LLM

As part of my MSc AI project in collaboration with Narosu Co., Ltd., I am developing an on-premise AI-powered product recommendation system tailored for independent online store operators. While large marketplaces benefit from advanced AI models, smaller vendors often lack such capabilities.

This project bridges that gap by delivering a localized, privacy-preserving, conversational AI chatbot that recommends products based on semantic understanding of customer input.

Key Features and Technical Implementation:

Retrieval-Augmented Generation (RAG): Implements RAG pipeline using LangChain to combine vector-based retrieval with generative LLM responses, ensuring relevant and personalized product recommendations.
Natural Language Processing (NLP) and Vector Search: Utilizes semantic similarity search over a dataset of over 4 million product records to retrieve the top 5 most relevant products based on user input.
Conversational AI Chatbot Design:
- Supports conversation-based interaction using Open-Source LLMs (LLaMA)
- Supports multi-turn dialogue and remembers user preferences during a session.
- Handles English, Korean, and Japanese input.
- Recommends products with at least two product images included in the chatbot response via image URLs.
Adaptive Recommendations:
- If initial suggestions are rejected, it proposes alternatives using unmentioned keywords.
- Asks follow-up questions like preferred color, brand, or options to refine results.
Multi-source Integration: Combines data from OwnerClan (3 products) and Naver (2 products via API or web crawling).
On-Premise Deployment: Ensures full control over recommendation logic and data privacy without relying on cloud infrastructure.
Performance Optimization: Chatbot responses are optimized to return within 5–10 seconds.

Technologies Used:

Embedding Model: text-embedding-ada-002 (OpenAI)
Vector Database: Qdrant
LLM Reasoning: LLaMA 3.1 8B Instant
Framework: LangChain for RAG pipeline
User Interface: Gradio
Backend: Gradio + FastAPI

Link

Syntax-Aware LLM-Powered Code Completion Extension for Microsoft Small Basic and C

This project introduces two new Visual Studio Code (VSCode) extensions designed to enhance code completion capabilities for Microsoft Small Basic and C, addressing the limitations of traditional and AI-based tools like Copilot. Unlike standard VSCode extensions that rely on user-typed prefixes or language-specific grammar assumptions, the proposed extensions provide syntax-structure-based code suggestions without requiring initial input, thereby aiding users unfamiliar with programming grammar. Built using the YAPB parser builder tool and LR grammars, the system uses WithinTop3Guide to present the top three context-aware suggestions. It also features an interactive preview of candidates—displaying both identifier names and decomposable expressions—to improve comprehension and usability for both novice and advanced users. The extensions aim to provide grammar-compliant, generative AI-driven code completions and are fully open-source on GitHub.

Link 1 (C)

Link 2 (SB)

Cardio Vas. Disease Detection Using Random Forest Vs Decision Trees

The Cardio Vas. Disease Detection project aims to develop a machine learning model in Python to predict the presence or absence of cardiovascular diseases (CVD) based on a set of input features. The project compares the performance of two popular tree-based algorithms, Random Forest and Decision Trees, for CVD prediction. These algorithms are widely used in the field of machine learning and offer different advantages and trade-offs. The project utilizes a dataset containing various clinical and demographic features of individuals, such as age, gender, blood pressure, cholesterol levels, and smoking habits. Each instance in the dataset is labeled as either having a cardiovascular disease or being disease-free.

Link

Cleveland Heart Disease Dataset Use Various Evaluation Metrics

The Cleveland Heart Disease dataset contains a wide range of patient attributes, including clinical, demographic, and physiological features such as age, gender, cholesterol levels, blood pressure, and electrocardiogram measurements. Each instance in the dataset is labeled as either having heart disease or being disease-free. In this project, various machine learning algorithms will be applied to the dataset, including but not limited to decision trees, random forest, support vector machines (SVM), logistic regression, and neural networks. These algorithms offer different strengths and weaknesses, and by evaluating their performance, we can identify the most effective approach for heart disease prediction.

Link

Cyberbullying Detection on Social Platforms

The project utilizes NLP techniques to analyze the text content of social media posts and classify them as either cyberbullying or non-cyberbullying. NLP encompasses a range of methods and algorithms that enable computers to understand and process human language. By applying techniques such as text preprocessing, sentiment analysis, part-of-speech tagging, and machine learning, the project aims to extract meaningful features from text data and develop a robust model for cyberbullying detection. The development of the model involves several steps. Firstly, a comprehensive dataset of social media posts, labeled as either cyberbullying or non-cyberbullying, is collected and prepared for analysis. The dataset may include various forms of text data, such as tweets, comments, or forum posts, from different social media platforms. Next, the collected data is preprocessed to remove noise, handle punctuation, and transform the text into a format suitable for analysis. NLP techniques, such as tokenization, stemming, and

Link

Iris Flower Classification using Deep Learning

The Iris flower dataset consists of measurements of sepal length, sepal width, petal length, and petal width for three different species of Iris flowers: Setosa, Versicolor, and Virginica. The goal of the project is to develop a deep learning model that can analyze these feature measurements and accurately predict the corresponding Iris flower species. In this project, a deep learning model will be designed and trained using the Iris flower dataset. The model will be built using popular deep learning frameworks such as TensorFlow or PyTorch, which provide a wide range of tools and functionalities for constructing and training neural networks.

Coming
Soon