Build you own Mini LLM
In this project, we create a miniature version of a Large Language Model (LLM) using Python.
The goal is to build a conversational AI that can answer user questions based on a
predefined knowledge base stored in a JSON file. The program uses TF-IDF (Term
Frequency-Inverse Document Frequency) and cosine similarity to compare the user's question
with the questions in the JSON file. If a match is found (above a certain similarity
threshold), the corresponding answer is returned. If no match is found, the bot can learn
from the user by asking for the correct answer and updating its knowledge base.
Key Features:
1. Question-Answer Matching:
○ The bot uses TF-IDF vectorization and cosine similarity to find the best match for the
user's question.
○ If the similarity score exceeds a threshold (e.g., 50%), the bot provides the
corresponding answer.
2. Interactive Learning:
○ If the bot doesn't know the answer, it asks the user for help: "Sorry, I don't know the
answer to that. Can you help me improve?"
○ If the user agrees, the bot prompts for the correct answer and updates the JSON file with
the new question-answer pair.
3. Persistent Knowledge Base:
○ The bot's knowledge is stored in a JSON file, which is updated dynamically as the bot
learns new information.
○ This ensures that the bot retains its knowledge across sessions.
4. Scalable and Customizable:
○ The JSON file can be easily expanded with new questions and answers, allowing the bot to
grow its knowledge base over time.
○ The similarity threshold and learning mechanism can be customized to suit specific use
cases.
How It Works:
1. The bot loads the JSON file containing predefined questions and answers.
2. The user inputs a question.
3. The bot converts the questions into TF-IDF vectors and calculates the cosine similarity
between the user's question and the predefined questions.
4. If a match is found, the bot returns the corresponding answer.
5. If no match is found, the bot asks the user for the correct answer and updates the JSON
file.
Applications:
○ Educational Tool: Teach students about AI, natural language processing, and programming.
○ Customer Support: Build a simple chatbot for answering FAQs.
○ Personal Assistant: Create a custom assistant for specific tasks or knowledge domains.
Imports
○ json: Used for reading and writing JSON files.
○ TfidfVectorizer: From scikit-learn, used to convert a collection of raw documents to a
matrix of TF-IDF features.
○ cosine_similarity: Also from scikit-learn, used to calculate cosine similarity between
vectors.
import json from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity
Function Definitions
1. load_qa_data(file_path)
Loads questions and answers from a JSON file.
def load_qa_data(file_path): try: with open(file_path, 'r') as file: data = json.load(file) except FileNotFoundError: data = {} return data
2. save_qa_data(file_path, qa_data)
Saves updated questions and answers back to the JSON file.
def save_qa_data(file_path, qa_data): with open(file_path, 'w') as file: json.dump(qa_data, file, indent=4)
3. preprocess_data(qa_data)
Preprocesses loaded data to separate questions and answers into lists.
def preprocess_data(qa_data): questions = list(qa_data.keys()) answers = list(qa_data.values()) return questions, answers
4. find_best_match(user_question, questions, answers, threshold=0.5)
Finds the best matching answer to a user's question using TF-IDF and cosine similarity.
def find_best_match(user_question, questions, answers, threshold=0.5): all_questions = questions + [user_question] vectorizer = TfidfVectorizer() tfidf_matrix = vectorizer.fit_transform(all_questions) user_vector = tfidf_matrix[-1] question_vectors = tfidf_matrix[:-1] similarities = cosine_similarity(user_vector, question_vectors).flatten() best_match_index = similarities.argmax() best_match_score = similarities[best_match_index] if best_match_score >= threshold: return answers[best_match_index] else: return None
JSON sample:
Main implementation example: