Machine Learning
&
Neural Networks Blog

Build you own Mini LLM

In this project, we create a miniature version of a Large Language Model (LLM) using Python. The goal is to build a conversational AI that can answer user questions based on a predefined knowledge base stored in a JSON file. The program uses TF-IDF (Term Frequency-Inverse Document Frequency) and cosine similarity to compare the user's question with the questions in the JSON file. If a match is found (above a certain similarity threshold), the corresponding answer is returned. If no match is found, the bot can learn from the user by asking for the correct answer and updating its knowledge base.

Key Features:
1. Question-Answer Matching:
○ The bot uses TF-IDF vectorization and cosine similarity to find the best match for the user's question.
○ If the similarity score exceeds a threshold (e.g., 50%), the bot provides the corresponding answer.
2. Interactive Learning:
○ If the bot doesn't know the answer, it asks the user for help: "Sorry, I don't know the answer to that. Can you help me improve?"
○ If the user agrees, the bot prompts for the correct answer and updates the JSON file with the new question-answer pair.
3. Persistent Knowledge Base:
○ The bot's knowledge is stored in a JSON file, which is updated dynamically as the bot learns new information.
○ This ensures that the bot retains its knowledge across sessions.
4. Scalable and Customizable:
○ The JSON file can be easily expanded with new questions and answers, allowing the bot to grow its knowledge base over time.
○ The similarity threshold and learning mechanism can be customized to suit specific use cases.

How It Works:
1. The bot loads the JSON file containing predefined questions and answers.
2. The user inputs a question.
3. The bot converts the questions into TF-IDF vectors and calculates the cosine similarity between the user's question and the predefined questions.
4. If a match is found, the bot returns the corresponding answer.
5. If no match is found, the bot asks the user for the correct answer and updates the JSON file.

mini

Applications:
○ Educational Tool: Teach students about AI, natural language processing, and programming.
○ Customer Support: Build a simple chatbot for answering FAQs.
○ Personal Assistant: Create a custom assistant for specific tasks or knowledge domains.




Imports
○ json: Used for reading and writing JSON files.
○ TfidfVectorizer: From scikit-learn, used to convert a collection of raw documents to a matrix of TF-IDF features.
○ cosine_similarity: Also from scikit-learn, used to calculate cosine similarity between vectors.


 import json
 from sklearn.feature_extraction.text import TfidfVectorizer
 from sklearn.metrics.pairwise import cosine_similarity
                            

Function Definitions
1. load_qa_data(file_path)
Loads questions and answers from a JSON file.


 def load_qa_data(file_path):
    try:
        with open(file_path, 'r') as file:
            data = json.load(file)
    except FileNotFoundError:
        data = {}
    return data
                            

2. save_qa_data(file_path, qa_data)
Saves updated questions and answers back to the JSON file.


 def save_qa_data(file_path, qa_data):
    with open(file_path, 'w') as file:
        json.dump(qa_data, file, indent=4)                        
                            

3. preprocess_data(qa_data)
Preprocesses loaded data to separate questions and answers into lists.


 def preprocess_data(qa_data):
    questions = list(qa_data.keys())
    answers = list(qa_data.values())
    return questions, answers
                            

4. find_best_match(user_question, questions, answers, threshold=0.5)
Finds the best matching answer to a user's question using TF-IDF and cosine similarity.


 def find_best_match(user_question, questions, answers, threshold=0.5):
    all_questions = questions + [user_question]
                        
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(all_questions)
                        
    user_vector = tfidf_matrix[-1]  
    question_vectors = tfidf_matrix[:-1]  
    similarities = cosine_similarity(user_vector, question_vectors).flatten()
                        
    best_match_index = similarities.argmax()
    best_match_score = similarities[best_match_index]
                        
    if best_match_score >= threshold:
        return answers[best_match_index]
    else:
        return None
                            



JSON sample:


mini

Main implementation example:


mini



Get the 'Build you own Mini LLM' code and a starting json file.

If you found this project interesting, you can share a coffee with me, by accessing the below link.

Boost Your Brand's Visibility

Partner with us to boost your brand's visibility and connect with our community of tech enthusiasts and professionals. Our platform offers great opportunities for engagement and brand recognition.

Interested in advertising on our website? Reach out to us at office@ml-nn.eu.