This article will guide you through building a Streamlit chat application that uses a local LLM, specifically the Llama 3.1 8b model from Meta, integrated via the Ollama library.
Prerequisites
Before we dive into the code, make sure you have the following installed:
Python
Streamlit
Ollama
Setting Up Ollama and Downloading Llama 3.1 8b
First, you’ll need to install Ollama and download the Llama 3.1 8b model. Open your command line interface and execute the following commands:
# Install Ollama pip install ollama
# Download Llama 3.1 8b model ollama run llama3.1:8b
Creating the Modelfile
To create a custom model that integrates seamlessly with your Streamlit app, follow these steps:
In your project directory, create a file named Modelfile without any extension.
Open Modelfile in a text editor and add the following content:
model: llama3.1:8b
This file instructs Ollama to use the Llama 3.1 8b model.
The code
Importing Libraries and Setting Up Logging
import streamlit as st from llama_index.core.llms import ChatMessage import logging import time from llama_index.llms.ollama import Ollama
logging.basicConfig(level=logging.INFO)
streamlit as st: This imports Streamlit, a library for creating interactive web applications.
ChatMessage and Ollama: These are imported from the llama_index library to handle chat messages and interact with the Llama model.
logging: This is used to log information, warnings, and errors, which helps in debugging and tracking the application’s behavior.
time: This library is used to measure the time taken to generate responses.
Initializing Chat History
if 'messages' not in st.session_state: st.session_state.messages = []
st.session_state: This is a Streamlit feature that allows you to store variables across different runs of the app. Here, it’s used to store the chat history.
The if statement checks if ‘messages’ is already in session_state. If not, it initializes it as an empty list.
Function to Stream Chat Response
def stream_chat(model, messages): try: llm = Ollama(model=model, request_timeout=120.0) resp = llm.stream_chat(messages) response = "" response_placeholder = st.empty() for r in resp: response += r.delta response_placeholder.write(response) logging.info(f"Model: {model}, Messages: {messages}, Response: {response}") return response except Exception as e: logging.error(f"Error during streaming: {str(e)}") raise e
stream_chat: This function handles the interaction with the Llama model.
Ollama(model=model, request_timeout=120.0): Initializes the Llama model with a specified timeout.
llm.stream_chat(messages): Streams chat responses from the model.
response_placeholder = st.empty(): Creates a placeholder in the Streamlit app to dynamically update the response.
The for loop appends each part of the response to the final response string and updates the placeholder.
logging.info logs the model, messages, and response.
The except block catches and logs any errors that occur during the streaming process.
Main Function
def main(): st.title("Chat with LLMs Models") logging.info("App started")
model = st.sidebar.selectbox("Choose a model", ["mymodel", "llama3.1 8b", "phi3", "mistral"]) logging.info(f"Model selected: {model}")
except Exception as e: st.session_state.messages.append({"role": "assistant", "content": str(e)}) st.error("An error occurred while generating the response.") logging.error(f"Error: {str(e)}")
if __name__ == "__main__": main()
main: This is the main function that sets up and runs the Streamlit app.
st.title("Chat with LLMs Models"): Sets the title of the app.
model = st.sidebar.selectbox("Choose a model", ["mymodel", "llama3.1 8b", "phi3", "mistral"]): Creates a dropdown menu in the sidebar for model selection.
if prompt := st.chat_input("Your question"): Takes user input and appends it to the chat history.
The for loop displays each message in the chat history.
The if statement checks if the last message is not from the assistant. If true, it generates a response from the model.
with st.spinner("Writing..."): Shows a spinner while the response is being generated.
messages = [ChatMessage(role=msg["role"], content=msg["content"]) for msg in st.session_state.messages]: Prepares the messages for the Llama model.
response_message = stream_chat(model, messages): Calls the stream_chat function to get the model’s response.
duration = time.time() - start_time: Calculates the time taken to generate the response.
response_message_with_duration = f"{response_message}\n\nDuration: {duration:.2f} seconds": Appends the duration to the response message.
st.session_state.messages.append({"role": "assistant", "content": response_message_with_duration}): Adds the assistant’s response to the chat history.
st.write(f"Duration: {duration:.2f} seconds"): Displays the duration of the response generation.
The except block handles errors during the response generation and displays an error message.
To run your Streamlit app, execute the following command in your project directory:
streamlit run app.py
Make sure your Ollama instance is running in the background to get any activity or results.
“Llama 3.1 8b generating a detailed response to the question ‘What are Large Language Models?’ in the Streamlit app.”“Continuation of the conversation, showing the final part of the Llama 3.1 8b’s response about Large Language Models.”
Training Models with Ollama
The same steps can be utilized to train models on different datasets using Ollama. Here’s how you can manage and train models with Ollama.
“Interactive chat interface with Llama 3.1 8b in the Streamlit app, showcasing real-time response generation.”
Ollama Commands
To use Ollama for model management and training, you’ll need to be familiar with the following commands:
Example: Creating and Using a Model
Create a Modelfile: Create a Modelfile in your project directory with instructions for your custom model.
Content of Modelfile:
# Example content for creating a custom model name: custom_model base_model: llama3.1 data_path: /path/to/your/dataset epochs: 10
3. Create the Model: Use the create command to create a model from the Modelfile.
ollama create -f Modelfile
4.Run the Model: Once the model is created, you can run it using:
ollama run custom_model
Integrate with Streamlit or whatever: You can integrate this custom model with your Streamlit application similarly to how you integrated the pre-trained models.
By following these steps, you can create a Streamlit application that interacts with local LLMs using the Ollama library.
C:\your\path\location>ollama Usage: ollama [flags] ollama [command] Available Commands: serve Start ollama create Create a model from a Modelfile show Show information for a model run Run a model pull Pull a model from a registry push Push a model to a registry list List models ps List running models cp Copy a model rm Remove a model help Help about any command Flags: -h, - help help for ollama -v, - version Show version information Use "ollama [command] - help" for more information about a command.
Additionally, you can use the same steps and Ollama commands to train and manage models on different datasets. This flexibility allows you to leverage custom-trained models in your Streamlit applications, providing a more tailored and interactive user experience.
Implementation with Flask
This methodology can also be utilized to implement chat applications using Flask. Here is an outline for integrating Ollama with a Flask app:
Flask Application Setup
Install Flask:
pip install Flask
2. Create a Flask App:
from flask import Flask, request, jsonify from llama_index.core.llms import ChatMessage from llama_index.llms.ollama import Ollama import logging
@app.route('/chat', methods=['POST']) def chat(): data = request.json messages = data.get('messages', []) model = data.get('model', 'llama3.1 8b')
try: llm = Ollama(model=model, request_timeout=120.0) resp = llm.stream_chat(messages) response = "" for r in resp: response += r.delta logging.info(f"Model: {model}, Messages: {messages}, Response: {response}") return jsonify({'response': response}) except Exception as e: logging.error(f"Error during streaming: {str(e)}") return jsonify({'error': str(e)}), 500
if __name__ == '__main__': app.run(debug=True)
Running the Flask Application
Save the code in a file (e.g., app.py) and run the following command:
python app.py
This will start the Flask application, and you can make POST requests to the /chat endpoint with JSON data containing the messages and model to get responses from the Llama model.
Integrating Flask with Ollama
By following similar steps as shown for Streamlit, you can integrate Ollama with a Flask application. The stream_chat function can be reused, and the Flask routes can handle the interaction with the model, making it easy to create scalable chat applications.
Conclusion
By following this guide, you’ve successfully set up a Streamlit chat application using a local LLM. This setup allows you to interact with powerful language models directly from your local machine, providing a visually appealing and interactive experience. Whether you’re asking general questions or delving into specific inquiries, your app is now equipped to handle it all.
https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fgiphy.com%2Fembed%2FXXMWSAMXq1ebfxcKdV%2Ftwitter%2Fiframe&display_name=Giphy&url=https%3A%2F%2Fmedia.giphy.com%2Fmedia%2Fv1.Y2lkPTc5MGI3NjExOXozaW8xb2N2M2M2MTQ2Z3BjcjhvZDJhcHdscno0cmd1enVxbndxdiZlcD12MV9naWZzX3NlYXJjaCZjdD1n%2FXXMWSAMXq1ebfxcKdV%2Fgiphy.gif&image=https%3A%2F%2Fmedia1.giphy.com%2Fmedia%2Fv1.Y2lkPTc5MGI3NjExODBvM2wxdjhyNHRud2FsOTJ3M2tyaDUwYnlmaGExNzV5OGFnaGtvOCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw%2FXXMWSAMXq1ebfxcKdV%2Fgiphy.gif&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=giphy“Thank you for exploring the power of Large Language Models with us. Goodbye!”
Before knowing what is active directory, answer this question, what is directory??
What is Directory?
Directory basically means a hierarchical arrangement of different kind of entities. Entities can be anything like document, books, access controls, address book or dictionary.
What is Active directory?
Active directory is like a digital directory made for computers and other devices to be managed in a network. Active Directory (AD) is a Microsoft technology which is a primary feature of windows server and Operating system (OS). AD enables centralized management, authentication, authorization and access control mechanisms.
Why making directories are important?
Making directories is like arranging your space properly along with proper rights and access mechanisms.
Directories are important because it helps in:
Organizing the information or documents. It keeps digital directories stored in a logical manner along with easy access.
Directories makes our work efficient and quicker.
Security measures like access controls and authentication and authorization becomes easy to embed in well-organized directories.
Scalability increases. Scalability means ease of adding new data or files into the directory.
Directories allows ease integration between systems and applications. Protocols like LDAP (Lightweight Directory Access Protocol) {we will see what is that in some time} & APIs {Application Programming Interface} enables interoperability between software platforms, allowing them to exchange information
Architecture of Active Directory
As we saw above, that Active directory is a hierarchical structure made for efficient usage and organization of entities.
Active directory consists of different components. Some of the major components are:
1. Domains: This is the fundamental unit of whole logical organization. It represents a group of network objects (computers, users, devices) that share a common directory database, security policies, and trust relationships. Each domain has its own unique name and can be managed independently.
2. Domain Controllers (DCs): These are the servers that store a copy of AD Database & authenticate users & computers within the domain. They sync with the changes which happens in AD database at the same time to ensure consistency and fault tolerance.
3. Active Directory Database: This is the huge database which contains every information like objects, users, groups, computers, Ous etc along with different schemas and structures.
4. Tree & child domains: Think this as a family tree, which we used to make in pre-primary sections. It is the hierarchical structure starting from the main node to child nodes.
5. Forest: It is the collection of more than 1 trees/domains which share a common schema, configuration and global catalog.
6. Organizational Units (OUs): These are like the class CR’s. They manage and handle objects within a specific assigned domain. They add group policies, assign admin tasks, simplify directory management etc.
7. Global Catalog (GC): This is responsible for having partial replica of all the objects from each domain within the forest. Think of this like you are a student and you are searching for relevant course in a specific college. Then you will see the global catalogue and select a specific course.
Services of Active Directory
AD DS — Active directory Domain Services — These are the core services provided by Microsoft Windows OS. It has different features but basically its responsible for organizing and controlling access to network resources in Windows domain environment.
Basic services provided by AD DS are:
Authentication: It means verification of the identity of user or device. Example entering the password for account login.
Authorization: It means level of authority or level of access control. Example you are allowed to use lab computers in school but you aren’t allowed to use teacher’s side computer from the staff room.
Directory Services: This is a store room which stores and organizes information about users, groups, computers and other network resources in centralized database.
Certificate service: It keeps an eye on managing, revoking and issues related to the certificates used for authentication, encryption and digital signatures.
DNS Integration: AD DS facilitates the service of DNS (name resolution services). It converts the IP address into domain name and vice versa.
Some Key protocols associated with AD DS are:
1. LDAP (Lightweight Directory Access Protocol): This protocol is easy to understand and implement. It allows clients to search, add, modify, and delete directory objects like users, groups, computers etc. It works over TCP/IP and it is the primary means of communication between ADDS clients and domain controllers.
2. Kerberos Authentication: As the name speaks, it is used for secure authentication between clients and domain controllers (DCs). It works over TCP & UDP. The main components of Kerberos are Authentication Server, Database and TGS (Ticket Granting Server).
Process of Kerberos:
a. User sends log in request (ticket-granting request). This request is sent to KDC (Key Distribution center).
b. Authentication server uses the verification method for user using database. If the user gets verified then he/she gets the ticket-granting-ticket (TGT) and session key. If not, then the log in request fails. TGT and session key are encrypted using user’s password.
c. Then comes the role of TGS that is Ticket Granting server. TGS verifies the TGT and issues a service ticket for requested service. The service ticket is encrypted using the service’s secret key, not the user’s password.
d. The target service then decrypts the service ticket using its own secret key and verifies the user’s identity. If the verification is successful, the user is granted access to requested service.
3. NTLM (New Technology LAN Manager): It is used as single sign on processes (SSO). When a user tries to access a network, server sends a challenge (16-byte random numbers), which is a random string of characters. User encrypts the challenge using a hash of user’s password and sends it back to server. Server sends this to DC and DC retrieves the user’s password from the database and encrypts the challenge. Then DC compares the encrypted challenge & client response. If both matches then authentication is successful and access is granted.
5. Kerberos & LDAP over SSL/TLS: LDAPS & Kerberos authentication over SSL/TLS becomes more secure which increases the confidentiality and integrity for directory operations.
Advantages Of AD:
Centralized Management — Managing everything under a single domain is known as centralized management. AD manages users, computers, groups & other networks centrally.
Single Sign-On — Users can log in to different applications or resources using a single domain credentials.
Integration with Microsoft services — AD is integrated with different Microsoft services like Exchange Server, SharePoint & office 365 which increases productivity.
Group policy management — AD gives the right to enforce policies and configurations in the specific part of network or devices. This enhances security management.
Identity Management — AD allows admins to manage user identities, credentials & access permissions.
Disadvantages Of AD:
Complexity — Implementing and managing AD & AD Domain Services (ADDS) is complex and requires proper in-depth understanding.
Single point of failure- If the main primary domain controller fails then whole directory and services can crash down.
Maintenance Overhead — ADDS requires more efforts to maintain starting from software updates to patching vulnerabilities to uphold performance.
Compatibility — AD is specifically made for windows but if you want to add it in another operating system then it might become challenging.
Cost — AD requires licensing fees and may require additional hardware resources which increases the cost.
It’s been exactly a decade since I started attending GeekCon (yes, a geeks’ conference 🙂) — a weekend-long hackathon-makeathon in which all projects must be useless and just-for-fun, and this year there was an exciting twist: all projects were required to incorporate some form of AI.
My group’s project was a speech-to-text-to-speech game, and here’s how it works: the user selects a character to talk to, and then verbally expresses anything they’d like to the character. This spoken input is transcribed and sent to ChatGPT, which responds as if it were the character. The response is then read aloud using text-to-speech technology.
Now that the game is up and running, bringing laughs and fun, I’ve crafted this how-to guide to help you create a similar game on your own. Throughout the article, we’ll also explore the various considerations and decisions we made during the hackathon.
Once the server is running, the user will hear the app “talking”, prompting them to choose the figure they want to talk to and start conversing with their selected character. Each time they want to talk out loud — they should press and hold a key on the keyboard while talking. When they finish talking (and release the key), their recording will be transcribed by Whisper (a speech-to-text model by OpenAI), and the transcription will be sent to ChatGPT for a response. The response will be read out loud using a text-to-speech library, and the user will hear it.
Implementation
Disclaimer
Note: The project was developed on a Windows operating system and incorporates the pyttsx3 library, which lacks compatibility with M1/M2 chips. As pyttsx3 is not supported on Mac, users are advised to explore alternative text-to-speech libraries that are compatible with macOS environments.
Openai Integration
I utilized two OpenAI models: Whisper, for speech-to-text transcription, and the ChatGPT API for generating responses based on the user’s input to their selected figure. While doing so costs money, the pricing model is very cheap, and personally, my bill is still under $1 for all my usage. To get started, I made an initial deposit of $5, and to date, I have not exhausted this deposit, and this initial deposit won’t expire until a year from now. I’m not receiving any payment or benefits from OpenAI for writing this.
Once you get your OpenAI API key — set it as an environment variable to use upon making the API calls. Make sure not to push your key to the codebase or any public location, and not to share it unsafely.
Speech to Text — Create Transcription
The implementation of the speech-to-text feature was achieved using Whisper, an OpenAI model.
Below is the code snippet for the function responsible for transcription:
if transcript is None: print("Transcription not available within the specified timeout.")
return transcript
This function is marked as asynchronous (async) since the API call may take some time to return a response, and we await it to ensure that the program doesn’t progress until the response is received.
As you can see, the get_transcript function also invokes the print_text_while_waiting_for_transcription function. Why? Since obtaining the transcription is a time-consuming task, we wanted to keep the user informed that the program is actively processing their request and not stuck or unresponsive. As a result, this text is gradually printed as the user awaits the next step.
String Matching Using FuzzyWuzzy for Text Comparison
After transcribing the speech into text, we either utilized it as is, or attempted to compare it with an existing string.
The comparison use cases were: selecting a figure from a predefined list of options, deciding whether to continue playing or not, and when opting to continue – deciding whether to choose a new figure or stick with the current one.
In such cases, we wanted to compare the user’s spoken input transcription with the options in our lists, and therefore we decided to use the FuzzyWuzzy library for string matching.
This enabled choosing the closest option from the list, as long as the matching score exceeded a predefined threshold.
for option in options: score = fuzz.token_set_ratio(transcript.lower(), option.lower()) if score > best_match_score: best_match_score = score best_match = option
if best_match_score >= 70: return best_match else: return ""
If you want to learn more about the FuzzyWuzzy library and its functions — you can check out an article I wrote about it here.
Get ChatGPT Response
Once we have the transcription, we can send it over to ChatGPT to get a response.
For each ChatGPT request, we added a prompt asking for a short and funny response. We also told ChatGPT which figure to pretend to be.
So our function looked as follows:
def get_gpt_response(transcript: str, chosen_figure: str) -> str: system_instructions = get_system_instructions(chosen_figure) try: return make_openai_request( system_instructions=system_instructions, user_question=transcript).choices[0].message["content"] except Exception as e: logging.error(f"could not get ChatGPT response. error: {str(e)}") raise e
and the system instructions looked as follows:
def get_system_instructions(figure: str) -> str: return f"You provide funny and short answers. You are: {figure}"
Text to Speech
For the text-to-speech part, we opted for a Python library called pyttsx3. This choice was not only straightforward to implement but also offered several additional advantages. It’s free of charge, provides two voice options — male and female — and allows you to select the speaking rate in words per minute (speech speed).
When a user starts the game, they pick a character from a predefined list of options. If we couldn’t find a match for what they said within our list, we’d randomly select a character from our “fallback figures” list. In both lists, each character was associated with a gender, so our text-to-speech function also received the voice ID corresponding to the selected gender.
This is what our text-to-speech function looked like:
Now that we’ve more or less got all the pieces of our app in place, it’s time to dive into the gameplay! The main flow is outlined below. You might notice some functions we haven’t delved into (e.g. choose_figure, play_round), but you can explore the full code by checking out the repo. Eventually, most of these higher-level functions tie into the internal functions we’ve covered above.
Here’s a snippet of the main game flow:
import asyncio
from src.handle_transcript import text_to_speech from src.main_flow_helpers import choose_figure, start, play_round, \ is_another_round
def farewell() -> None: farewell_message = "It was great having you here, " \ "hope to see you again soon!" print(f"\n{farewell_message}") text_to_speech(farewell_message)
while True: if not figure: figure = await choose_figure()
while another_round: await play_round(chosen_figure=figure) user_choices = await get_round_settings(figure) figure, another_round = \ user_choices.get("figure"), user_choices.get("another_round") if not figure: break
if another_round is False: farewell() break
if __name__ == "__main__": asyncio.run(main())
The Roads Not Taken
We had several ideas in mind that we didn’t get to implement during the hackathon. This was either because we did not find an API we were satisfied with during that weekend, or due to the time constraints preventing us from developing certain features. These are the paths we didn’t take for this project:
Matching the Response Voice with the Chosen Figure’s “Actual” Voice
Imagine if the user chose to talk to Shrek, Trump, or Oprah Winfrey. We wanted our text-to-speech library or API to articulate responses using voices that matched the chosen figure. However, we couldn’t find a library or API during the hackathon that offered this feature at a reasonable cost. We’re still open to suggestions if you have any =)
Let the Users Talk to “Themselves”
Another intriguing idea was to prompt users to provide a vocal sample of themselves speaking. We would then train a model using this sample and have all the responses generated by ChatGPT read aloud in the user’s own voice. In this scenario, the user could choose the tone of the responses (affirmative and supportive, sarcastic, angry, etc.), but the voice would closely resemble that of the user. However, we couldn’t find an API that supported this within the constraints of the hackathon.
Adding a Frontend to Our Application
Our initial plan was to include a frontend component in our application. However, due to a last-minute change in the number of participants in our group, we decided to prioritize the backend development. As a result, the application currently runs on the command line interface (CLI) and doesn’t have frontend side.
Additional Improvements We Have In Mind
Latency is what bothers me most at the moment.
There are several components in the flow with a relatively high latency that in my opinion slightly harm the user experience. For example: the time it takes from finishing providing the audio input and receiving a transcription, and the time it takes since the user presses a button until the system actually starts recording the audio. So if the user starts talking right after pressing the key — there will be at least one second of audio that won’t be recorded due to this lag.
Also, warm credit goes to Lior Yardeni, my hackathon partner with whom I created this game.
Summing Up
In this article, we learned how to create a speech-to-text-to-speech game using Python, and intertwined it with AI. We’ve used the Whisper model by OpenAI for speech recognition, played around with the FuzzyWuzzy library for text matching, tapped into ChatGPT’s conversational magic via their developer API, and brought it all to life with pyttsx3 for text-to-speech. While OpenAI’s services (Whisper and ChatGPT for developers) do come with a modest cost, it’s budget-friendly.
We hope you’ve found this guide enlightening and that it’s motivating you to embark on your projects.
Beginner’s step-by-step guide To Build A RAG Application From Scratch
Typical RAG Application | Skanda Vivek
In this blog, you are going to learn how to create your first RAG application for question answering over a document. Companies like Adobe are adopting these Q&A and chatting over documents as beta capabilities. Done right, these are powerful capabilities, empowering readers to gain novel insights from documents and save valuable time. Through building this app, you will encounter the multiple aspects necessary for successfully performing tasks like Q&A over documents. These include extracting information, retrieving the relevant context, and utilizing this context to generate accurate results.
By the end of this blog, you will have built a question answering over a 10-Q financial document, taking the Amazon Q1 2023 financial statement as the representative document, following the steps shown in figure above. First, we are going to discuss how to extract information from this document. Second, we will look at breaking the document into smaller chunks, to fit into LLM context windows. Third, we will discuss two strategies to save documents for future retrieval. One is storing the text as is for keyword based retrieval. The other is converting text into vector embeddings, for more efficient retrieval. Fourth, we will discuss saving this to a relevant database. Fifth, we will discuss obtaining relevant chunks based on user inputs. Finally, we will discuss how to incorporate relevant document chunks as part of LLM context, for generating the output. Steps 1 through 4 are referred to as the indexing pipeline, wherein documents are indexed in a database offline, prior to user interactions. Steps 5 and 6 happen in real-time as the user is querying the application.
The first step for answering questions over documents is to extract information as text for LLM. In my experience, the step of extraction is often the most overlooked factor, critical to the success of RAG applications. This is because ultimately, the quality of answers from the LLM depends on the data context that is provided. If this data has accuracy or consistency issues, this will lead to poor results overall. This section goes into the ways to extract data for RAG applications, focusing on extracting data from PDF documents in particular. You can think of the entire stage from extracting, to ultimately storing of data in the right database as similar to the traditional extract, transform, load (ETL) process where information is retrieved from an original data source, undergoes a series of modifications (including data cleansing and formatting), and is subsequently stored in a target data repository.
The basic way to extract text is to extract all the information from the PDF as a large string. This string can then be broken down into smaller chunks, to fit into LLM context windows.
PyMuDF is one such library that makes it easy to extract text from PDF documents as a string. There are other text parsers like PyPDF and PDFMiner with similar functionality. The advantage of PyMuPDF is that it supports parsing of multiple formats including txt, xps, images, etc., which is not possible in some of the other packages mentioned. Below, you can see how to extract text from the Amazon Q1–2023 PDF document using PyMuPDF, as a string:
import requests import fitz import io url = "https://s2.q4cdn.com/299287126/files/doc_financials/2023/q1/Q1-2023-Amazon-Earnings-Release.pdf" request = requests.get(url) filestream = io.BytesIO(request.content) with fitz.open(stream=filestream, filetype="pdf") as doc: #concatenating text to a string and printing out the first 10 characters text = "" for page in doc: text += page.get_text() print(text[:10])
Chunking Data
A natural first question is — why do all this, why not just send all the text to the LLM and let it answer questions? Let’s take the Amazon Q1 2023 document for example. The entire text is ~50k characters. If you try passing all the text as context to GPT-3.5, you get an error due to the context being too long.
LLMs typically have a token limit (each token is roughly 3/4th a word). Let’s see how to solve this with chunking. Chunking involves dividing a lengthy text into smaller sections that an LLM can process more efficiently.
The figure below outlines how to build a basic RAG that utilizes an LLM over custom documents for question answering. The first part is splitting multiple documents into manageable chunks. The associated parameter is the maximum chunk length. These chunks should be of the typical (minimum) size of text that contain the answers to the typical questions asked. This is because sometimes the question you ask might have answers at multiple locations within the document. For example, you might ask the question “What was X company’s performance from 2015 to 2020?” And you might have a large document (or multiple documents) containing specific information about company performance over the years in different parts of the document. You would ideally want to capture all disparate parts of the document(s) containing this information, link them together, and pass to an LLM for answering based on these filtered and concatenated document chunks.
RAG Components | Skanda Vivek
The maximum context length is basically the maximum length for concatenating various chunks together, leaving some space for the question itself and the output answer. Remember that LLMs like GPT3.5 have a strict length limit that includes all the content: question, context, and answer. Finding the right chunking strategy is crucial for building high quality RAG applications.
There are different methods to chunk based on use-case. Here are five levels of chunking based on the complexity and effectiveness.
Fixed Size Chunking: This is the most basic method, where the text is split into chunks of a specified number of characters, without considering the content or structure. It’s simple to implement but may result in chunks that lack coherence or context.
Recursive Chunking: This method splits the text into smaller chunks using a set of separators (like newlines or spaces) in a hierarchical and iterative manner. If the initial splitting doesn’t produce chunks of the desired size, it recursively calls itself on the resulting chunks with a different separator.
Document Based Chunking: In this approach, the text is split based on its inherent structure, such as markdown formatting, code syntax, or table layouts. This method preserves the flow and context of the content but may not be effective for documents lacking clear structure.
Semantic Chunking: This strategy aims to extract semantic meaning from embeddings and assess the semantic relationship between chunks. It adaptively picks breakpoints between sentences using embedding similarity, keeping together chunks that are semantically related.
Agentic Chunking: This approach explores the possibility of using a language model to determine how much and what text should be included in a chunk based on the context. It generates initial chunks using propositional retrieval and then employs an LLM-based agent to determine whether a proposition should be included in an existing chunk or if a new chunk should be created.
The similarity threshold is the way to compare the question with document chunks, to find the top chunks, most likely to contain the answer. Cosine similarity is the typical metric used, but you might want to weigh different metrics, such as including a keyword metric to weight contexts with certain keywords more. For example, you might want to weight contexts that contain the words “abstract” or “summary” when you ask the question to an LLM to summarize a document.
Let’s use simple fixed chunking in our first RAG app, splitting chunks by sentences where necessary. For this, we need to split up the texts into chunks, when they reach a provided maximum token length. The OpenAI tokenizer below, can be used to tokenize text, and calculate the number of tokens.
This text can then be split into multiple contexts as below, for LLM comprehension, based on a token limit. For this, the text is split into sentences from the period delimiter, and sentences are appended to a chunk. If the chunk length is beyond the token limit, that chunk is truncated, and the next chunk is started. In the figure below, you can see an example of chunking by sentences, where three chunks are displayed as three distinct paragraphs.
Sample Fixed Chunking | Skanda Vivek
Here is the split_into_many function that does the same:
def split_into_many(text: str, tokenizer: tiktoken.Encoding, max_tokens: int = 1024) -> list: """ Function to split a string into many strings of a specified number of tokens """
sentences = text.split('. ') #A n_tokens = [len(tokenizer.encode(" " + sentence)) for sentence in sentences] #B
chunks = [] tokens_so_far = 0 chunk = []
for sentence, token in zip(sentences, n_tokens): #C
return chunks #A Split the text into sentences #B Get the number of tokens for each sentence #C Loop through the sentences and tokens joined together in a tuple #D If the number of tokens so far plus the number of tokens in the current sentence is greater than the max number of tokens, then add the chunk to the list of chunks and reset #E If the number of tokens in the current sentence is greater than the max number of tokens, go to the next sentence #F # Otherwise, add the sentence to the chunk and add the number of tokens to the total
Finally, you can tokenize the entire text by calling the tokenize function, that concatenates the logic from above:
def tokenize(text,max_tokens) -> pd.DataFrame: """ Function to split the text into chunks of a maximum number of tokens """
return df #A Load the cl100k_base tokenizer which is designed to work with the ada-002 model #B Tokenize the text and save the number of tokens to a new column #C If the text is None, go to the next row #D If the number of tokens is greater than the max number of tokens, split the text into chunks #E Otherwise, add the text to the list of shortened texts
In the figure below, you can see how the entire dataframe looks, after running tokenize(text,500). Each chunk is a separate row, and there are 13 chunks in total. The chunk text is in the ‘text’ column, the number of tokens for that text in the ‘n_tokens’ column.
Chunked Data | Skanda Vivek
Retrieval Methods
The next step, after document extraction and chunking, is to store these documents in an appropriate format so that relevant documents or passages can be easily retrieved in response to future queries. In the following sections, you are going to see two characteristic methods to retrieve relevant LLM context: keyword based retrieval and vector embeddings based retrieval.
Keyword Based Retrieval
The easiest way to sort relevant documents is to do a keyword match and find documents with the highest match. For this, we need to first define a way to match documents based on keywords. In information retrieval, two important concepts form the foundation of many ranking algorithms: Term Frequency (TF) and Inverse Document Frequency (IDF).
Term Frequency (TF) measures how often a term appears in a document. It’s based on the assumption that the more times a term occurs in a document, the more relevant that document is to the term.
TF(t,d) = Number of times term t appears in document d / Total number of terms in document d
Inverse Document Frequency (IDF) measures the importance of a term across the entire corpus of documents. It assigns higher importance to terms that are rare in the corpus and lower importance to terms that are common.
IDF(t) = Total number of documents / Number of documents containing term t
The TF-IDF score is then calculated by multiplying TF and IDF:
TF-IDF(t,d) = TF(t,d) * IDF(t)
While TF-IDF is useful, it has limitations. This is where the Okapi BM25 algorithm comes in, offering a more sophisticated approach to document ranking.
The Okapi BM25 is a common algorithm for matching documents based on keywords, as shown in the figure below.
Given a query Q containing keywords q1,q2,…the BM25 score of a document D is as above. The function f(qi, D) is the number of times qi occurs in D, and k1, b are constants. IDF denotes the inverse document frequency of the word qi. IDF measures the importance of a term in the entire corpus. It assigns higher importance to terms that are rare in the corpus and lower importance to terms that are common. This is used to normalize contributions of common words like ‘The’, or ‘and’ from search results. avgdl is the average document length in the text collection from which documents are drawn.
The BM25 formula can be understood as an extension of TF-IDF:
It uses IDF to weigh the importance of terms across the corpus, similar to TF-IDF.
The term frequency component (f(qi,D)) is normalized using a saturation function, which prevents the score from increasing linearly with term frequency. This addresses a limitation of basic TF-IDF.
It incorporates document length normalization (|D| / avgdl), adjusting for the fact that longer documents are more likely to have higher term frequencies simply due to their length.
By considering these additional factors, BM25 often provides more accurate and nuanced document ranking compared to simpler TF-IDF approaches, making it a popular choice in information retrieval systems.
The BM25 algorithm returns a value between 0 (no keyword overlaps between Query and Document) and 1 (Document contains all keywords in Query). For example, if the user input is “windy day” and the document is “It is quite windy” — the BM25 algorithm would yield a non-zero result. Here is a snippet of a Python implementation of BM25:
from rank_bm25 import BM25Okapi corpus = [ "Hello there how are you!", "It is quite windy in Boston", "How is the weather tomorrow?" ] tokenized_corpus = [doc.split(" ") for doc in corpus] bm25 = BM25Okapi(tokenized_corpus) query = "windy day" tokenized_query = query.split(" ") doc_scores = bm25.get_scores(tokenized_query) have keyword overlap with the query. Output: array([0. , 0.48362189, 0. ]) #A #A Only the second document and the query have an overlap, the others do not
The user input here is “windy day”. As you can see, there is an overlap between the second document in the corpus (“It is quite windy in Boston”), and the input, which is reflected by the second score being the highest (0.48).
However, you also see how the third document (“How is the weather tomorrow?”) is related to the input (as both discuss the weather). We would like the third document to have some non-zero score.This is where the concept of semantic similarity and vector embeddings comes in. A classic example of this is where the user searches for “Wild West” and expects information about cowboys. Semantic search means that the algorithm is intelligent enough to know that cowboys and the wild west are similar concepts (while having different words). This becomes important for RAG as it is quite possible the user types in a query that is not exactly present in the document, for which we need a good measure of semantic similarity to find the relevant documents, according to the users intent.
Vector Embeddings
Vector search helps in choosing what the relevant context is when you have vast amounts of data, including hundreds or more documents.Vector search is a technique in information retrieval and machine learning that uses vector representations of data points to efficiently find similar items in a large dataset. It involves encoding data into high-dimensional vectors and using distance metrics to measure similarity between these vectors.
In the figure below, you can see a simplified two-dimensional vector space:
X-axis: Size (small = 0, big = 1)
Y-axis: Type (tree = 0, animal = 1)
This example illustrates both direction and magnitude:
A small tree might be represented as (0, 0)
A big tree as (1, 0)
A small animal as (0, 1)
A big animal as (1, 1)
The direction of the vector indicates the combination of features, while the magnitude (length) of the vector represents the strength or prominence of those features.
This is just a conceptual example and can be scaled to hundreds or more dimensions, each representing different attributes of the data. In real-world applications, these vectors often have much higher dimensionality, allowing for more nuanced representations of complex data.
The same can also be done with text as below, and yields better semantic similarity as compared to keyword search. An appropriate embedding algorithm would be able to judge which contexts are most relevant to user input, and which contexts are not as relevant, crucial for the retrieval step in RAG applications. Once this relevant context is found, this can be added to the user input, and passed to an LLM for generating the appropriate output, sent back to the user.
Vector Search 101 | Skanda Vivek
Notice how in the figure below, the vectorization is able to capture the semantic representation (i.e,. it knows that a sentence talking about a bird swooping in on a baby chipmunk should be in the (small, animal) quadrant, whereas the sentence talking about yesterday’s storm when a large tree fell on the road should be in the (big, tree) quadrant). In reality, there are more than two dimensions. For example, the OpenAI embedding model has 1,536 dimensions.
Vector Search 101 With Words | Skanda Vivek
Obtaining embeddings is quite easy from OpenAI’s embedding model. For this blog, we will use OpenAI embedding and LLM models. The OpenAI embedding model costs $0.10 /1M tokens, where each token is roughly 3/4th a word. A token is a word/subword. When text is passed through a tokenizer, it encodes the input based on a specific scheme and emits specialized vectors that can be understood by the LLM. The cost is quite minimal — roughly 10 cents per 3000 pages, but can add up as the number of documents and users scale up.
import openai from getpass import getpass api_key = getpass('Enter the OpenAI API Key in the cell ') client = openai.OpenAI(api_key=api_key) openai.api_key =api_key def get_embedding(text, model="text-embedding-ada-002"): return client.embeddings.create(input = [text], model=model).data[0].embedding #A e1=get_embedding('the boy went to a party') e2=get_embedding('the boy went to a party') e3=get_embedding("""We found evidence of bias in our models via running the SEAT (May et al, 2019) and the Winogender (Rudinger et al, 2018) benchmarks. Together, these benchmarks consist of 7 tests that measure whether models contain implicit biases when applied to gendered names, regional names, and some stereotypes. For example, we found that our models more strongly associate (a) European American names with positive sentiment, when compared to African American names, and (b) negative stereotypes with black women.""") #A Let's now get the embeddings of a few sample texts below.
The first two texts (corresponding to embeddings e1 and e2) are the same, so we would expect their embeddings to be the same, while the third text is completely different. To find the similarity between embedding vectors, we use cosine similarity. Cosine similarity measures the similarity between two vectors, measured by the cosine of the angle between the two vectors. Cosine similarity of 0 means that these texts are completely different, whereas cosine similarity of 1 implies identical or near identical text. We use the query below to find the cosine similarity:
SAs you can see, the cosine similarity (1-the cosine distance) is 1 for the
Vector Embeddings For Finding Relevant Context
Let’s now see how well vector embeddings do for choosing the right context for answering a question. Let’s say we want to ask this question, corresponding to Q1 2023 for Amazon and ask the question below:
prompt="""What was the sales increase for Amazon in the first quarter?""" We can get the answer from the GPT3.5 (ChatGPT) API as below: def get_completion(prompt, model="gpt-3.5-turbo"): response = openai.chat.completions.create( model="gpt-3.5-turbo", temperature=0, messages=[{"role": "user", "content": prompt}] ) #A return response.choices[0].message.content Answer: The sales increase for Amazon in the first quarter was 9%, with net sales increasing to $127.4 billion compared to $116.4 billion in the first quarter of 2022. Excluding the impact of foreign exchange rates, the net sales increased by 11% compared to the first quarter of 2022. #A calling OpenAI Completions endpoint
As you can see, while the above answer is not wrong, it is not the one we are looking for (we are looking for the sales increase for Q1 2023, not Q1 2022). So it is important to feed the right context to the LLM — in this case, this would be context related to sales performance in Q1 2023. Let’s say we have a choice of the three contexts below to append to the LLM:
#A Below are three contexts, from which the relevant ones need to be chosen, related to the user inputs (sales performance in Q1 2023) context1="""Net sales increased 9% to $127.4 billion in the first quarter, compared with $116.4 billion in first quarter 2022. Excluding the $2.4 billion unfavorable impact from year-over-year changes in foreign exchange rates throughout the quarter, net sales increased 11% compared with first quarter 2022. North America segment sales increased 11% year-over-year to $76.9 billion. International segment sales increased 1% year-over-year to $29.1 billion, or increased 9% excluding changes in foreign exchange rates. AWS segment sales increased 16% year-over-year to $21.4 billion.""" context2="""Operating income increased to $4.8 billion in the first quarter, compared with $3.7 billion in first quarter 2022. First quarter 2023 operating income includes approximately $0.5 billion of charges related to estimated severance costs. North America segment operating income was $0.9 billion, compared with operating loss of $1.6 billion in first quarter 2022. International segment operating loss was $1.2 billion, compared with operating loss of $1.3 billion in first quarter 2022. AWS segment operating income was $5.1 billion, compared with operating income of $6.5 billion in first quarter 2022. """ context3="""Net income was $3.2 billion in the first quarter, or $0.31 per diluted share, compared with net loss of $3.8 billion, or $0.38 per diluted share, in first quarter 2022. All share and per share information for comparable prior year periods throughout this release have been retroactively adjusted to reflect the 20-for-1 stock split effected on May 27, 2022. • First quarter 2023 net income includes a pre-tax valuation loss of $0.5 billion included in non-operating expense from the common stock investment in Rivian Automotive, Inc., compared to a pre-tax valuation loss of $7.6 billion from the investment in first quarter 2022."""
Measuring the cosine similarity between the query embeddings and three context embeddings, shows that the context1 has the highest cosine similarity with the query embeddings. Thus, appending this context to the user input and sending it to the LLM is more likely to give an answer relevant to the user input. We can feed this relevant context into the prompt as follows:
prompt=f”””What was the sales increase for Amazon in the first quarter based on the context below? Context: ``` {context1} ``` """ print(get_completion(prompt))
The answer given by the LLM is now the one we wanted as below, since it is the sales increase for Q1 2023:
The sales increase for Amazon in the first quarter was 9% based on the reported net sales of $127.4 billion compared to $116.4 billion in the first quarter of the previous year.
Augmented Generation
The steps discussed above are for preparing documents for when the user interacts with the RAG application by posing a query.
In this section we are going to look at how to use the chunked, embedded information as relevant context when the user queries the application. This step is to retrieve the context in real-time based on user input, and use the retrieved context to generate the LLM output. Let’s take the example that the user input is the question “What was the sales increase for Amazon in the first quarter?” based on the 10-Q Amazon document for Q1 2023. To answer this question, we have to first find the right contexts from the document chunks created above.
Let’s define a create_context function for this. As you can see, the create_context function below requires three inputs for this — the user input query to embed, the dataframe containing the documents to find the subset of relevant context(s) to the user input, and the maximum context length, as shown the figure below.
Retrieval And Generation
The logic here is to get the embeddings for the question as in the figure above, compute pairwise distances between the input query embedding, and context embeddings (step 2), and append these contexts, ranked by similarity (step 3). If the running context length is greater than the maximum context length, the context is truncated. Finally, both the user query and relevant context are sent to the LLM, for generating the output.
def create_context(question: str, df: pd.DataFrame,max_len: int = 1800) -> str: """ Create a context for a question by finding the most similar context from the dataframe """ q_embeddings = get_embedding(question) #A df['distances'] = df['embeddings'].apply(lambda x: spatial.distance.cosine(q_embeddings,x)) #B returns = [] cur_len = 0 for i, row in df.sort_values('distances', ascending=True).iterrows(): #C cur_len += row['n_tokens'] #D if cur_len > max_len: #E break returns.append(row["text"]) #F return "\n\n###\n\n".join(returns) #G ) #A Get the embeddings for the question #B Get the distances from the embeddings #C Sort by distance and add the text to the context until the context is too long #D Add the length of the text to the current length #E If the context is too long, break #F Else add it to the text that is being returned #G Return the context Here is the query and corresponding partial context created from running this line below: create_context("What was the sales increase for Amazon in the first quarter",df)
AMAZON.COM ANNOUNCES FIRST QUARTER RESULTS\nSEATTLE — (BUSINESS WIRE) April 27, 2023 — Amazon.com, Inc. (NASDAQ: AMZN) today announced financial results \nfor its first quarter ended March 31, 2023. \n•\nNet sales increased 9% to $127.4 billion in the first quarter, compared with $116.4 billion in first quarter 2022.\nExcluding the $2.4 billion unfavorable impact from year-over-year changes in foreign exchange rates throughout the\nquarter, net sales increased 11% compared with first quarter 2022.\n•\nNorth America segment sales increased 11% year-over-year to $76.9 billion….
As you can see, the context is quite relevant. However, this is not formatted well. This is where the LLM shines as below, where the LLM can answer the questions from the created context:
def answer_question( df: pd.DataFrame, question: str ): """ Answer a question based on the most similar context from the dataframe texts """ context = create_context( question, df ) prompt=f"""Answer the question based on the context provided. Question: ```{question}.``` Context: ```{context}``` """ response = openai.chat.completions.create( model="gpt-3.5-turbo", temperature=0, messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content
Finally, here is the corresponding answer generated from the query, and dataframe:
answer_question(df, question=”What was the sales increase for Amazon in the first quarter”)
The sales increase for Amazon in the first quarter was 9%, reaching $127.4 billion compared to $116.4 billion in the first quarter of 2022.
Congratulations, you have now built your first RAG app. While this works well for questions where the answer is explicit within the text context, the answer is not always accurate when retrieved from tables. Let’s ask the question “What was the Comprehensive income (loss) for Amazon for the Three Months Ended March 31, 2022?” — where the answer is present in a table as $ 4,833 million as shown in the figure below:
Answer Within A Table
The answer from the application is:
The Comprehensive income (loss) for Amazon for the Three Months Ended March 31, 2022 was a net loss of $3.8 billion.
As you can see, it gave the net income (loss), instead of the comprehensive income (loss). This illustrates the limitations of the basic RAG architecture we built.
In the next blog, we will learn about advanced document extraction, chunking, and retrieval mechanisms that build on the concepts learnt here. We will learn about evaluating the quality of responses from our RAG application using various metrics. We will learn how to use different techniques, guided by evaluation results, and make iterative improvements to performance.
In the last two weeks, we’ve seen what a SOC is and the difference between a Security Operations & Optimization Centre. All these discussions pointed to one key aspect: monitoring and early detection. But how do we monitor effectively? The answer lies in “logs.” Logs are crucial for monitoring and detecting user or device behaviour. Cybercrime is on the rise, putting businesses of all sizes and types at risk. To combat this threat, having a reliable security platform is essential.
So, what tools can forward these logs? There are many options available online, such as Splunk, Datadog, Sentinel One, Wazuh etc. Our focus today will be on the Wazuh tool today.
Wazuh is a dynamic security platform designed to protect businesses of all sizes from cyber threats. With its user-friendly interface and rich features, it offers customized solutions to meet various business needs. It is all-in-one open-source tool, which delivers extensive threat detection, visibility, and response capabilities.
Capabilities of Wazuh
Types of Security Platform provided:
All-In-One: This platform integrates all security components into a single package. It includes the Wazuh manager, Elastic Stack, and the Wazuh agent. It does everything in one place. This means all the security features are on one computer. Ideal for small to medium-sized environments where managing multiple servers separately isn’t necessary.
Distributed: Components are spread across different servers. The Wazuh manager, Elastic Stack, and agents operate on separate machines. Each part does a specific job. This is good for larger systems because it can handle more data and work faster. It’s like having a team where everyone has a role.
Centralized: The Wazuh manager and Elastic Stack are centralized, while the agents are deployed across various endpoints. Best for environments with numerous endpoints, ensuring centralized control and monitoring while distributing agents. All the information from different parts of your system comes to one place. It’s useful for managing and seeing everything in one spot.
Cluster: Involves multiple Wazuh managers working together in a cluster, providing redundancy and load balancing. This is similar to distributed but with a focus on high availability. It means Wazuh can keep working even if one part fails. It’s like having backup systems in place. Critical for large-scale enterprises requiring high availability and disaster recovery capabilities.
Wazuh Components
1. Data Collection and Monitoring
Wazuh Agent: The Wazuh Agent is installed on endpoint devices to gather security-related data from sources like system logs, network traffic, and system processes. It works on various devices, including servers, desktops, laptops, and mobile devices. The agent sends this data to the Wazuh Manager for analysis, either in real-time or at scheduled intervals. It is lightweight and has minimal impact on system performance, making it suitable for endpoint devices.
Filebeat: Filebeat is a lightweight tool used to collect and forward log data to different destinations. In the Wazuh context, it collects log data from endpoints and sends it to the Wazuh server for analysis. The Wazuh server has pre-built Filebeat modules for common data sources like Apache web server logs, MySQL database logs, and system logs. These modules are easily configurable to start collecting data. Additionally, Wazuh provides a custom Filebeat module for collecting Windows event logs, which can then be forwarded to the Wazuh server for analysis.
2. Central Management and Analysis
Wazuh Manager: The Wazuh Manager is the central control point for the Wazuh platform. It handles the management of agents, rulesets, and notifications. It collects data from agents, analyzes it, and stores it in a database. The manager can send alerts to administrators via email, SMS, or other methods. It features a web-based interface for administrators to manage the platform, providing dashboards, reports, and system settings. The manager is highly scalable and can support thousands of agents, making it suitable for large organizations.
Wazuh Server: The Wazuh Server analyzes data from the agents and triggers alerts when threats or anomalies are detected. It also remotely manages the agents’ configurations and monitors their status. The server uses threat intelligence to improve detection and enriches alert data with frameworks like MITRE ATT&CK and compliance requirements such as PCI DSS, GDPR, HIPAA, CIS, and NIST 800–53. This provides helpful context for security analytics. The server can integrate with external software like ServiceNow, Jira, PagerDuty, and Slack to streamline security operations. It consists of components for enrolling new agents, validating agent identities, and encrypting communications between agents and the server.
Analysis Engine: The Analysis Engine is the server component responsible for analyzing data. It uses decoders to identify the type of information being processed, such as Windows events, SSH logs, and web server logs. These decoders extract relevant data elements from log messages, like source IP addresses, event IDs, and usernames. The engine then applies rules to identify patterns in the decoded events that could trigger alerts or call for automated countermeasures, such as banning an IP address, stopping a process, or removing malware.
Wazuh Ruleset: The Wazuh Ruleset is a collection of rules designed to detect and alert on security events. It is customizable and can be tailored to meet the specific needs of an organization. The ruleset contains over 2000 rules covering a wide range of security events, including malware, suspicious behavior, and system vulnerabilities. It is regularly updated to stay effective against the latest threats. Organizations can also create custom rules to address their specific security requirements. The ruleset is crucial for detecting security threats within the Wazuh platform.
3. Data Storage and Indexing
Wazuh Indexer: The Wazuh Indexer is a powerful search and analytics engine designed to handle large amounts of data. It indexes and stores alerts from the Wazuh server, enabling quick data searches and analytics. The indexer can operate as a single-node or multi-node cluster, making it scalable and reliable. It stores data as JSON documents, with each document containing key-value pairs. These documents are organized into collections called indexes and are distributed across containers known as shards. By spreading documents across multiple shards and nodes, the Wazuh Indexer ensures redundancy and protects against hardware failures, while also improving query performance.
Archival Data Storage: The Wazuh server stores both alerts and other events in files. These files can be in JSON (.json) or plain text (.log) format. The files are compressed and signed daily using MD5, SHA1, and SHA256 checksums to ensure data integrity. This method provides an additional layer of data storage and security beyond what is sent to the Wazuh Indexer.
4. User Interface and Visualization
Wazuh App: The Wazuh App is a user-friendly interface that lets you access data collected by Wazuh. With this web-based app, you can view alerts, manage agents, and customize rulesets. It offers dashboards displaying key security metrics like the number of alerts, agents, and the severity of alerts. The app also provides customizable reports to meet specific needs. It is highly customizable, allowing organizations to tailor it to their specific requirements.
Wazuh Dashboard: The Wazuh Dashboard is a web interface for analyzing and visualizing security events and alerts. It helps in managing and monitoring the Wazuh platform. The dashboard supports role-based access control (RBAC) and single sign-on (SSO). It includes pre-configured dashboards for regulatory compliance (such as PCI DSS, GDPR, HIPAA, and NIST 800–53) and provides an interface to navigate the MITRE ATT&CK framework and related alerts.
5. Integration and Extension
Wazuh API: The Wazuh API is a RESTful interface for accessing data stored in the Wazuh database. It allows developers to create custom applications using the data collected by Wazuh. Through the API, you can retrieve information about agents, alerts, and events, as well as manage rulesets and notifications. The API is well-documented and supports multiple programming languages, including Python, Ruby, and Java, making it easy for developers to integrate Wazuh into existing security systems.
6. Scalability and High Availability
Wazuh Cluster Daemon: The Wazuh Cluster Daemon enables horizontal scaling of Wazuh servers by deploying them as a cluster. This setup, combined with a network load balancer, provides high availability and load balancing. The cluster daemon allows Wazuh servers to communicate and stay synchronized, ensuring seamless operation and reliability.
7. Security and Access Control
Agent Enrollment Service: The Agent Enrollment Service is responsible for enrolling new agents in the Wazuh system. It assigns and distributes unique authentication keys to each agent. This service operates as a network service and supports authentication through TLS/SSL certificates or a fixed password.
Agent Connection Service: The Agent Connection Service handles data received from agents. It uses the keys provided by the enrollment service to verify each agent’s identity and secure communications between the Wazuh agent and server. Additionally, it allows centralized management of agent configurations, enabling remote updates of agent settings.
Advantages of Wazuh
Next week, we will learn how to create rules and trigger logs. For this, you will need Ubuntu (as a Wazuh agent) and RHEL v8 or 9 (as a Wazuh server). If you have these prerequisites, you can follow along with the next steps.
Everyone has to start with the first steps in understanding the complex world of cybersecurity. In today’s digital landscape, where threats are ever-evolving, it’s essential to grasp how foundational defense mechanisms work to protect vulnerable applications. By diving into hands-on labs and real-world scenarios, we can learn how to deploy, configure, and optimize tools like Web Application Firewalls (WAFs), Intrusion Prevention Systems (IPSs), and Security Information and Event Management (SIEM) systems. These practical experiences lay the groundwork for mastering cybersecurity and emphasize the importance of integrating strong security practices from the development phase through deployment.
Architecture
Defense is one of the most important approaches in Cyber Security. To offer a comprehensive defense mechanism, three key points should be considered:
Monitoring — the collection and analysis of different metrics.
Detection — the identification of potential threats or malicious activity.
Response — the measures taken to prevent potential damage caused by a threat.
These elements are applicable to almost any information system. Let’s consider this scenario: we have one or several web applications that need to be defended. To address this challenge, we can use a web application firewall to monitor, detect, and prevent malicious actions. However, sometimes we need to enhance effectiveness by setting up more specific components such as Intrusion Prevention Systems (IPS) and Security Information and Event Management (SIEM) systems, in conjunction with the firewall.
An example of such a model might include the following components:
PfSense — Firewall
Suricata — IPS
Squid Proxy — To hide the Trusted Zone under the firewall.
Splunk — SIEM
bWAP — Example of a web application
The environment can be set up using virtualization technologies, Oracle Virtual Box in our case.
Web Application Firewall
First, the required machines should be set up. image for PfSense can be downloaded from the official website: https://www.pfsense.org/download/. Then, it should be installed as BSD system on Virtual Box:
We will need two network adapters:
Bridged Adapter — responsible for the Public Internet Connectivity. In PfSense will be assigned as WAN.
Internal Network — responsible for the connectivity of trusted zone. In PfSense will be assigned as LAN.
After Setup, it should like this:
We obtained the IP address 192.168.0.200 for the WAN interface and 192.168.1.1 for the LAN interface. The LAN interface will act as the gateway for the rest of the components. PfSense provides routing functionalities as well as a DHCP server out of the box.
For our future needs, we need to enable additional services:
Squid Proxy — to provide proxy services.
Suricata — to serve as an IDS (Intrusion Detection System).
Both packages can be installed using the integrated Add-ons Store. Additionally, we need to ensure that the Systemd daemon is enabled to send all logs from the interfaces to the SIEM.
Suricata
When configuring Suricata, we can enable some basic rules, such as those designed to protect against SQL injections:
…and, of course, enable it in prevention mode:
Splunk
The next step is to set up the SIEM. First, a virtual machine (VM) with an operating system needs to be created. To simplify this process, you can use a ready-made VM image from OSBoxes. To conserve computing resources, the Lubuntu 22.04 image was selected. The network adapter should be set to Internal Network.
After properly creating the app, we can enable it in Splunk:
Web Application
The final step is to configure the target web application, which is bWAPP. A ready-made VM image can be found on VulnHub: https://www.vulnhub.com/entry/bwapp-bee-box-v16,53/. It should be deployed using only the Internal Adapter and will be assigned the IP address 192.168.1.3.
Squid Proxy
Now, it’s time to test all requests to and from the Trusted Zone clients, ensuring they are properly proxied within the firewall.
It is possible to enforce the use of custom certificates for proxying. Without the certificate, clients will not be able to access the public internet. The certificate can be generated through the PfSense interface.
Testing
In order to test the system, we can send a request with SQL injection to the web application from external host from the Public Internet zone.
In PfSense it is possible to monitor how requests are blocked:
In the SIEM corresponding events are available as well:
However, it is possible to apply our created app as filter and print the results in more compact form:
Conclusion
The given example demonstrated how properly configured defense mechanisms can prevent dangerous attacks, even with a completely vulnerable application. However, this does not mean that WAFs, IPSs, and SIEMs can replace the need for proper security testing and hardening of the application during the development and deployment process.
This type of lab helps users understand how applications are deployed, as well as the underlying network topologies and protocols. Therefore, it is crucial to allocate time for such labs.
LDAP (Lightweight Directory Access Protocol) is an open, vendor-neutral, industry-standard application protocol for accessing and maintaining distributed directory information services over an Internet Protocol (IP) network. Directories organized via LDAP typically follow a hierarchical model and store information such as user profiles, authentication details, and other attributes necessary for various applications, services, and organizations.
How LDAP Works
LDAP operates by connecting to a directory service which then responds to queries made by client applications. Common use cases include querying user information in a corporate environment, retrieving organizational data, and managing network resources. LDAP directories are structured in a tree-like hierarchy called the Directory Information Tree (DIT).
Key Components:
1. Entries: Fundamental unit containing information in the directory.
2. Attributes: Characteristics of entries, defined by attribute types.
3. Distinguished Name (DN): Unique identifier for each entry in the directory.
4. Schema: Defines the structure of entries and attributes.
Directory Database Structure
A directory database structure refers to the organization and layout of data within a directory service, which is a specialized database used for storing and managing information about users, groups, resources, and other objects within a computer network. One common example of a directory service is the Lightweight Directory Access Protocol (LDAP), which is widely used for authentication, authorization, and information retrieval in networks.
The structure of a directory database typically follows a hierarchical model, similar to the structure of a tree. At the top level of the hierarchy is the root directory, which contains subdirectories, also known as organizational units (OUs). Each OU can further contain sub-OUs or leaf objects, such as users, groups, or resources.
Directories use various attributes to describe the objects they contain. These attributes can include information such as names, addresses, phone numbers, email addresses, group memberships, and access permissions. The schema of a directory defines the attributes and object classes that can be used to describe objects within the directory.
An LDIF (LDAP Data Interchange Format) template is a text-based format used to represent directory data in a human-readable and machine-readable form. LDIF files are commonly used to import or export data to and from directory services. LDIF templates provide a structured format for specifying the attributes and values of directory objects.
Here’s an example of an LDIF template for creating a user object:
dn: cn=John Doe,ou=Users,dc=example,dc=com objectClass: inetOrgPerson cn: John Doe sn: Doe givenName: John mail: john.doe@example.com userPassword: password123
In this LDIF template:
• dn (Distinguished Name) specifies the unique identifier for the user object within the directory hierarchy.
• objectClass specifies the type of object being created. In this case, it’s an inetOrgPerson, which is a standard LDAP object class for representing individuals.
• cn, sn, givenName, mail, and userPassword are attributes of the inetOrgPerson object class, representing common properties such as common name, surname, given name, email address, and password.
LDIF templates can be customized to include additional attributes or object classes as needed to represent different types of directory objects. They provide a flexible and standardized way to manage directory data across different directory services and applications.
Basic LDAP Operations
LDAP supports a variety of operations, including:
• Bind: Authenticates and specifies the LDAP protocol version.
• Search and Compare: Queries the directory and compares attributes.
• Modify: Alters entries.
• Add and Delete: Adds new entries or removes existing ones.
Modify DN: Changes an entry’s DN.
Understanding LDAP Injection: LDAP injection occurs when an attacker manipulates user inputs to control the construction of LDAP queries executed by the application. These queries are responsible for interacting with LDAP directories to authenticate users, retrieve information, or perform administrative tasks. By injecting malicious LDAP filter strings, attackers can subvert the intended functionality of the application and gain unauthorized access to sensitive data or escalate privileges.
Recognizing User Input in LDAP Queries: As users interact with web applications, they often provide inputs through various forms and fields. In the context of LDAP injection, these inputs can be leveraged by attackers to inject LDAP special characters and manipulate the underlying LDAP queries. To identify whether user input is embedded in an LDAP query, users can pay attention to the application’s response to their inputs. Here are some expected responses that may indicate user input is being processed as part of an LDAP query:
1. Immediate Error Messages:
If the application displays error messages immediately after submitting user input, it could indicate that the input is directly incorporated into LDAP queries without proper validation.
LDAP Query Error: The application may generate an error message related to a failed LDAP query execution. For example: “LDAP Query Failed: Invalid search filter.”
Internal Server Error: A generic message indicating that something went wrong on the server side. This might not specifically mention LDAP, but it could indicate a problem related to LDAP query processing.
2. Unusual Application Behavior:
• Users may notice unexpected behavior in the application, such as pages loading slowly or becoming unresponsive, which could suggest that their input is triggering LDAP query execution.
3. Incomplete or Inconsistent Results:
• In some cases, users may receive incomplete or inconsistent search results when querying for specific information. This inconsistency may indicate that the application is vulnerable to LDAP injection and is processing user input in LDAP queries.
4. Abnormal Response Times:
• Significant delays in response times after submitting certain inputs may imply that the application is processing LDAP queries asynchronously or executing multiple queries in the background.
5. Changes in Page Content:
• Users might observe unexpected changes in the content of web pages, such as additional information being displayed or elements disappearing, which could signal successful injection of LDAP filter strings.
Here is a Python script that takes a user’s full name as input to search for their identity in an LDAP directory:
from ldap3 import Server, Connection, ALL
def search_user(full_name): # Establish a connection to the LDAP server server = Server('ldap://example.com', get_info=ALL)
# Create a connection object and bind to the LDAP server using admin credentials conn = Connection(server, user=f'cn=admin,dc=example,dc=com', password='adminpassword') conn.bind()
# Construct an LDAP search filter based on the provided full name search_filter = f'(cn={full_name})'
# Perform an LDAP search with the constructed filter conn.search('ou=users,dc=example,dc=com', search_filter, attributes=['cn', 'mail', 'title'])
# Check if the search returned any entries if conn.entries: for entry in conn.entries: print(f"User: {entry.cn}, Email: {entry.mail}, Title: {entry.title}") else: print("No users found")
# Unbind from the LDAP server conn.unbind()
# Simulate user input search_user('John Doe')
In this code, the application allows searching for users by their full name (cn attribute).
Potential Vulnerabilities
This code is vulnerable to LDAP injection if the full_name parameter is not properly sanitized. An attacker can exploit this vulnerability by injecting LDAP-specific payloads into the full_name input.
Payload Examples
1. Enumerating All Users
• User Input (Full Name): *)(cn=*)
• Resulting Search Filter: (cn=*)(cn=*)
This payload will retrieve all users because the injected filter will match any entry with a cn attribute.
Types of LDAP injection:
1. Authentication Bypass
Description: Attackers modify the LDAP query to bypass authentication checks.
Example Payload:
• User Input: *)(uid=*)
• Original Query: (uid={user_input})
Resulting Query: (uid=*)(uid=*)
Steps:
1. The application constructs an LDAP query using user input to authenticate users.
2. If the user input is not sanitized, an attacker can input *)(uid=*).
3. The resulting LDAP query becomes (uid=*)(uid=*), which always returns true for any uid.
4. The application might treat this as a successful authentication for any user, bypassing the authentication mechanism.
2. Authorization Bypass
Description: Attackers manipulate queries to gain unauthorized access to resources.
Example Payload:
• User Input: *)(memberOf=cn=admin,ou=groups,dc=example,dc=com)
If the username and password aren’t sanitized, an attacker can use special characters to manipulate the query.
Let’s try random user name and password and observe the response. It says only admin account available
Injection Example: let’s use admin account along with * as password, the query becomes:
search_filter = “(&(cn=admin)(sn=*))”
This query will match entry in the LDAP directory, effectively bypassing authentication and logging the attacker in as an Admin user.
Mitigation Strategies
To prevent LDAP injection attacks, consider the following mitigation strategies:
• Input Validation: Validate and sanitize all user inputs to ensure they conform to expected formats and do not contain harmful characters.
• Parameterization: Use parameterized queries to separate user input from the query logic.
• Escaping: Properly escape special characters in user inputs to prevent query manipulation.
• Least Privilege Principle: Limit the privileges of the LDAP account used by the application to reduce the impact of potential injections.
• Regular Audits: Conduct regular security audits and penetration testing to identify and address LDAP injection vulnerabilities.
By understanding and mitigating these types of LDAP injection, organizations can better protect their LDAP directories and applications from malicious exploitation
Conclusion
LDAP injection remains a significant security vulnerability that can have severe implications for applications that rely on LDAP for user authentication and directory services. Understanding the mechanics of LDAP queries and the potential for exploitation is crucial for developers and security professionals alike.
By implementing best practices such as input validation, using prepared statements, and employing robust error handling, organizations can effectively mitigate the risks associated with LDAP injection. Regular security assessments and code reviews are essential to ensure that applications are resilient against such attacks.
Selenium lets me interact with web pages just like a regular user would. I can click buttons, fill out forms, and even handle content that loads after the page has initially loaded. It’s especially useful when I need to scrape data from complex websites that other tools can’t handle.
What is Selenium?
Selenium is an open-source automation tool primarily used for testing web applications. It mimics the actions of a real user interacting with a website, making it an excellent choice for scraping dynamic pages that rely heavily on JavaScript.
Unlike static HTML pages, where data can be easily retrieved using traditional scraping methods like BeautifulSoup or Scrapy, dynamic pages require a more robust solution to render and interact with the content — this is Selenium’s strength.
Why Use Selenium for Web Scraping?
Handling JavaScript: Many modern websites load content dynamically using JavaScript. Traditional scraping tools often fail here because they only retrieve the initial HTML. Selenium, however, can execute JavaScript, allowing you to scrape data that appears only after the page has fully loaded.
User Interaction Simulation: Selenium can simulate user interactions like clicking buttons, filling forms, and scrolling pages. This is crucial for scraping data that requires such interactions, like loading additional content through infinite scroll.
Headless Browsing: Selenium supports headless browsing, which means you can run the browser without a graphical user interface (GUI). This is especially useful for running automated scraping scripts in production environments.
Best Alternatives to Selenium
Web scraping with APIs — Using APIs for web scraping can save a lot of time and resources, read more here.
Web scraping with Node.js — One of the easiest ways to scrape websites, read more here.
Web scraping with AI — What’s better than utilizing the power of AI to improve your web scraping operations? Read more here.
Using web scraping tools — Use dedicated web scraping tools that will help you save time and money. Read more here.
Setting Up Selenium
Before diving into examples, you need to set up Selenium in your Python environment. Here’s a quick guide:
Install Selenium:
pip install selenium
Download a WebDriver: Selenium requires a WebDriver to interact with browsers. WebDrivers are specific to each browser (e.g., ChromeDriver for Google Chrome, GeckoDriver for Firefox).
Setting Up the WebDriver: After downloading, ensure that the WebDriver is accessible through your system’s PATH. Alternatively, you can specify the WebDriver’s path directly in your script.
Basic Web Scraping Example
Now, let’s dive into a basic example where we’ll scrape some data from a website using Selenium.
Step 1: Import the Required Libraries
from selenium import webdriver from selenium.webdriver.common.by import By
Step 2: Set Up the WebDriver
# Make sure to replace 'path/to/chromedriver' with the actual path to your ChromeDriver driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
Step 3: Open the Web Page
driver.get("https://example.com")
Step 4: Interact with the Web Page
# Let’s assume we want to scrape all article titles from a blog page
titles = driver.find_elements(By.CLASS_NAME, 'article-title') for title in titles: print(title.text)
Step 5: Close the Browser
driver.quit()
This simple script demonstrates how to open a web page, locate elements by their class name, and extract text from them.
Handling Dynamic Content
One of Selenium’s biggest advantages is handling dynamic content. Websites often load content after a delay or based on user interactions like scrolling or clicking a button. Here’s how to deal with such scenarios:
Example: Scraping Data After Scrolling
Some websites load additional content when you scroll down the page. Selenium can simulate scrolling, enabling you to scrape all the data, not just what’s initially visible.
from selenium.webdriver.common.keys import Keys # Scroll down the page driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.END) # Wait for content to load import time time.sleep(2) # Adjust the sleep time based on the website's loading speed # Scrape the newly loaded content new_content = driver.find_elements(By.CLASS_NAME, 'new-content-class') for item in new_content: print(item.text)
Handling Form Submissions and Button Clicks
Selenium allows you to interact with various elements on the page, such as forms and buttons. Here’s an example where we simulate a form submission:
# Locate the input fields and submit button username = driver.find_element(By.NAME, 'username') password = driver.find_element(By.NAME, 'password') submit_button = driver.find_element(By.ID, 'submit') # Enter data into the form fields username.send_keys("myUsername") password.send_keys("myPassword") # Click the submit button submit_button.click() # Wait for the next page to load time.sleep(3) # Scrape data from the next page result = driver.find_element(By.ID, 'result') print(result.text)
Dealing with Pop-ups and Alerts
Web pages often contain pop-ups or alerts that can interfere with your scraping. Selenium can handle these as well:
# Handling an alert pop-up alert = driver.switch_to.alert alert.accept() # To accept the alert # alert.dismiss() # To dismiss the alert
Headless Browsing for Faster Scraping
Running a browser in headless mode can speed up the scraping process, especially when running scripts on a server. Here’s how to set it up:
While Selenium is a powerful tool, it’s important to follow best practices to avoid issues:
Respect Website’s Robots.txt: Before scraping, check the website’s robots.txt file to ensure you’re not violating their policies.
Use Random Delays: To avoid detection as a bot, use random delays between actions:
import random time.sleep(random.uniform(2, 5))
Avoid Overloading the Server: Don’t make too many requests in a short time. This can overload the server and get your IP banned.
Rotate IPs and User-Agents: For large-scale scraping, consider rotating IP addresses and user-agent strings to reduce the risk of being blocked.
Handle Exceptions Gracefully: Always handle exceptions like timeouts and element not found errors to ensure your script doesn’t crash.
Conclusion
Web scraping with Selenium gives me the power to pull data from complex and dynamic websites. It’s a bit tricky to learn than some other tools, but the payoff is huge. With Selenium, I can mimic real user actions, which makes it a game-changer for anyone diving into data science or web development. By sticking to best practices and really getting the most out of Selenium, I can create strong, reliable scrapers that fit exactly what I need.
You’ve already started your Python project by following the best practices I mentioned in the previous article, right?
Now, it’s time to focus on maintaining your project, this involves ensuring that the project remains organized, functional, easy to share and scale.
1. requirements.txt
The requirements.txt file lists all the dependencies your project needs to run. This file allows others to recreate your environment and automate in pipelines.
Steps:
To generate the file, use the command:
pip freeze > requirements.txt
To install dependencies from the file, use:
pip install -r requirements.txt
2. Update Your README.md and MkDocs
A well-documented project avoid this:
README.md should contain:
Project Description
Installation Instructions and How to Run
Usage Examples
Contribution Guidelines (Open Source)
MkDocs is a tool for creating professional documentation sites:
Install MkDocs:
pip install mkdocs
# or
poetry add mkdocs
Create a documentation project:
mkdocs new project-docs
Serve the site locally with:
cd project-docs mkdocs serve
Deploy to GitHub Pages with:
mkdocs gh-deploy
Now you will have a well-documented project!
3. Pre-commit
Pre-commit hooks automate checks before allowing changes to be committed.
This can help you save a lot of time and enforce code quality preventing common errors.
Steps:
Install Pre-commit with:
pip install pre-commit
Add a .pre-commit-config.yaml file with hooks for tasks like fixing trailing whitespace or linting.
repos: - repo: https://github.com/pre-commit/pre-commit-hooks rev: v4.4.0 hooks: - id: check-yaml # check-yaml: Ensures your YAML files are valid. - id: end-of-file-fixer # end-of-file-fixer: Ensures there’s exactly one newline at the end of files. - id: trailing-whitespace # trailing-whitespace: Removes trailing whitespace from files. - id: debug-statements # debug-statements: Warns about leftover print() or pdb statements. - repo: https://github.com/psf/black rev: 23.9.1 hooks: - id: black # black: Formats your Python code according to the Black code style. language_version: python3
Install the hooks using:
pre-commit install
Test the Pre-Commit hooks:
pre-commit run --all-files
Expected return:
(venv) PS C:\WVS\project-docs> pre-commit run --all-files check yaml...............................................................Passed fix end of files.........................................................Passed trim trailing whitespace.................................................Passed debug statements (python)................................................Passed black....................................................................Passed
Now, every commit will run these checks automatically.
4. Docker
If you’re sharing your project using a requirements.txt file, you should also consider using Docker in the same way.
Make all your projects Docker-first because it’s the most effective method for ensuring that others can easily replicate and run them on their machines.
And of course guarantees consistent environments across development and testing stages.
Steps:
Create a Dockerfile defining the project’s environment and dependencies.
# Use the official Python image as the base image FROM python:3.12-slim
# Set the working directory inside the container WORKDIR /app
# Copy the requirements file and install dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt
# Copy the application code into the container COPY src/ ./src/
# Expose the port the app runs on EXPOSE 5000
# Set the default command to run the application CMD ["gunicorn", "--bind", "0.0.0.0:5000", "src.app:app"]
Build the Docker Image:
docker build -t api .
Run the Docker Container:
docker run -p 5000:5000 --env-file .env api
Tip: Add a .dockerignore File to prevent unnecessary files from being copied into the Docker image, if necessary.
5. Test-Driven Development (TDD)
Test-Driven Development (TDD) involves writing tests before implementing the functionality.
Instead of you going to test manually, you can write some tests to simulate some actions.
Steps:
Install Testing Dependencies:
pip install pytest
# or
poetry add pytest
Create a tests/ Directory
Here are an example from my code:
import sys import os import pytest
# Add the 'src' directory to the Python path sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '../src')))
from app import app
@pytest.fixture def client(): app.config['TESTING'] = True with app.test_client() as client: yield client
def test_api_is_up(client): # Send a GET request to the root endpoint response = client.get('/')
# Assert that the response status code is 200 (OK) assert response.status_code == 200