Deep Dive into LangGraph’s Data Enrichment Template: Crafting a Highly Pythonic Future Deployment Solution

KevinLuo
7 min readSep 20, 2024

--

By KevinLuo

Please follow my GitHub (https://github.com/kevin801221/LLMs_Amazing_courses_Langchain_LlamaIndex )or encourage me by giving it a star ⭐️. I will update it with the latest theory or applications at least weekly.

In today’s AI-driven world, leveraging Large Language Models (LLMs) to handle open-ended research tasks and structuring the results into databases or spreadsheets is becoming increasingly prevalent. One powerful tool that facilitates this process is LangGraph.

In this article, we’ll explore how to use LangGraph’s Data Enrichment Template to build a highly Pythonic data enrichment agent. This agent automates information gathering from the web and structures it according to a user-defined JSON schema, making it ready for future deployment.

LangGraph Document

https://www.langchain.com/langgraph

What Is the LangGraph Data Enrichment Template?

LangGraph is a robust framework for building LLM-powered agents. Its Data Enrichment Template is a general-purpose template designed to help developers create agents that automatically collect information from the web and structure the results into a user-defined JSON format.

Key Features:

  • Accepts a research topic and extraction schema as input.
  • Searches the web for relevant information.
  • Extracts key details from websites.
  • Organizes findings into the desired structured format.
  • Validates the gathered information for completeness and accuracy.

Getting Started

1. Set Up the Environment

First, ensure you have LangGraph Studio installed. If not, you can install it using:

pip install langgraph-studio

Next, create a .env file in your project root:

cp .env.example .env

In the .env file, define the required API keys. The primary search tool used is Tavily, and you'll need to obtain an API key here.

2. Configure the Model

The default model configuration is:

model: anthropic/claude-3-5-sonnet-20240620

Using Anthropic Models

ANTHROPIC_API_KEY=your-api-key

Using OpenAI Models

  • Sign Up for an OpenAI API Key: Visit OpenAI’s website.
  • Add the API Key to Your .env File:
OPENAI_API_KEY=your-api-key

3. Define the Research Topic and Extraction Schema

Example Research Topic:

“Top 5 Chip Providers for LLM Training”

Example Extraction Schema (extraction_schema):

{
"type": "object",
"properties": {
"companies": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "Company name"
},
"technologies": {
"type": "string",
"description": "Brief summary of key technologies used by the company"
},
"market_share": {
"type": "string",
"description": "Overview of market share for this company"
},
"future_outlook": {
"type": "string",
"description": "Brief summary of future prospects and developments in the field for this company"
},
"key_powers": {
"type": "string",
"description": "Which of the 7 Powers best describe this company's competitive advantage"
}
},
"required": ["name", "technologies", "market_share", "future_outlook"]
},
"description": "List of companies"
}
},
"required": ["companies"]
}

4. Input the Topic and Extraction Schema in LangGraph Studio

Open LangGraph Studio and input your research topic and extraction schema into the respective fields.

How to Customize

1. Customize Research Targets

Provide a custom JSON extraction schema when calling the graph to gather different types of information.

2. Select a Different Model

While the default is Anthropic’s claude-3-5-sonnet-20240620, you can choose a compatible chat model by configuring provider/model-name. For example:

model: openai/gpt-4o-mini

3. Customize the Prompt

You can update the default prompt in prompts.py via configuration to tailor the agent's behavior.

Example:

# prompts.py
DEFAULT_PROMPT = """
You are a highly skilled data enrichment agent specializing in web information gathering and structuring.
"""

4. Extend the Template

  • Add New Tools and API Connections: Include new Python functions in tools.py.
  • Add Additional Steps in graph.py: Insert new nodes and edges to enhance functionality.

Building Your First Data Enrichment Agent

Let’s walk through creating a data enrichment agent that gathers information about the “Top 5 Chip Providers for LLM Training” and structures the data according to the predefined extraction schema.

Step 1: Set Up the Project Environment

Create a new project directory:

mkdir langgraph-data-enrichment
cd langgraph-data-enrichment

Create and activate a Python virtual environment:

python3 -m venv venv
source venv/bin/activate # For Windows, use venv\\\\Scripts\\\\activate

Install the required dependencies:

pip install langgraph langchain openai anthropic tavily jsonschema

Step 2: Configure the .env File

Add your API keys to the .env file:

OPENAI_API_KEY=your-openai-api-key
ANTHROPIC_API_KEY=your-anthropic-api-key
TAVILY_API_KEY=your-tavily-api-key

Step 3: Define the Research Topic and Extraction Schema

Create a config.py file:

# config.py
RESEARCH_TOPIC = "Top 5 Chip Providers for LLM Training"
EXTRACTION_SCHEMA = {
# Include the JSON schema provided earlier
}

Step 4: Build the Agent

Create an agent.py file:

# agent.py
import os
from langgraph import LangGraph
from langchain.chat_models import ChatAnthropic
from config import RESEARCH_TOPIC, EXTRACTION_SCHEMA
# Load API keys
anthropic_api_key = os.getenv("ANTHROPIC_API_KEY")
# Initialize the language model
llm = ChatAnthropic(
model_name="claude-3-5-sonnet-20240620",
anthropic_api_key=anthropic_api_key
)
# Initialize LangGraph
graph = LangGraph(llm=llm)
def data_enrichment_agent(topic, extraction_schema):
# Step 1: Search the web
search_results = graph.run_tool("search", topic)
# Step 2: Extract information
extracted_data = []
for result in search_results:
content = graph.run_tool("read", result['link'])
data = graph.extract(content, extraction_schema)
extracted_data.append(data)
# Step 3: Organize and validate
final_data = {'companies': extracted_data}
is_valid = graph.validate(final_data, extraction_schema)
if is_valid:
return final_data
else:
raise ValueError("Data does not conform to the extraction schema")
if __name__ == "__main__":
enriched_data = data_enrichment_agent(RESEARCH_TOPIC, EXTRACTION_SCHEMA)
print(enriched_data)

Step 5: Implement Supporting Functions

Implement necessary methods or use LangGraph’s built-in functions for web searching, content reading, information extraction, and data validation.

Step 6: Run the Agent and View Results

Execute the agent:

python agent.py

Sample Output:

{
"companies": [
{
"name": "NVIDIA",
"technologies": "GPU accelerated computing, CUDA platform",
"market_share": "Over 80% market share in AI chips",
"future_outlook": "Developing next-gen GPUs, expanding in AI and data centers",
"key_powers": "Scale Economies, Network Economies, Cornered Resource"
},
{
"name": "Intel",
"technologies": "Xeon processors, FPGAs",
"market_share": "Significant share in data center chips",
"future_outlook": "Investing in AI accelerators, challenging GPU market",
"key_powers": "Scale Economies, Branding"
}
// Additional companies...
]
}

Advanced Customization

Adding Custom Tools

To add sentiment analysis to the company descriptions:

# tools.py
def sentiment_analysis(text):
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
return sentiment_pipeline(text)

Integrate it into your agent:

from tools import sentiment_analysis
# In your data_enrichment_agent function
for data in extracted_data:
data['sentiment'] = sentiment_analysis(data['technologies'])

Modifying the Prompt

Customize the LLM prompt for better extraction accuracy:

# prompts.py
CUSTOM_PROMPT = """
As an expert data extraction agent, extract the information according to the schema.
Schema:
{schema}
Text:
{content}
Provide the extracted information in JSON format.
"""

Deployment and Collaboration

Deploying with LangGraph Cloud

  • Sign Up: Visit LangGraph Cloud and create an account.
  • Deploy: Follow the deployment guides to make your agent accessible via API endpoints.

Integrating with LangSmith

For advanced tracing and team collaboration, integrate your agent with LangSmith.

Conclusion

By leveraging LangGraph’s Data Enrichment Template, we’ve crafted a highly Pythonic data enrichment agent capable of automating web information gathering and structuring. This agent is not only powerful but also flexible for future deployments and customizations.

Key Takeaways:

  • Flexibility: Easily customize research targets, models, and prompts.
  • Extensibility: Add new tools and steps to enhance functionality.
  • Deployability: Ready for deployment with LangGraph Cloud.

Next Steps

  • Explore LangGraph Documentation: docs.langgraph.com
  • Experiment with Different Models: Try out various LLMs to optimize performance.
  • Collaborate with Teams: Use LangSmith for better collaboration.

Feel free to leave your thoughts and questions in the comments below. If you found this article helpful, please clap and share it with others!

Follow me on Medium for more articles on AI and Pythonic solutions.*

Frequently Asked Questions (FAQs)

Q1: Can I use other LLMs apart from Anthropic’s Claude or OpenAI’s GPT-4?

A: Absolutely! LangGraph is designed to be model-agnostic. You can integrate any LLM that supports your required features. Just ensure you adjust your configurations accordingly.

Q2: Is it possible to use this agent for domains other than AI chip providers?

A: Yes, the agent is highly customizable. By changing the research topic and the extraction schema, you can tailor the agent to collect and structure data on virtually any subject.

Q3: How do I handle rate limits or API usage costs with the LLMs?

A: Both Anthropic and OpenAI have their own rate limits and billing structures. Be sure to monitor your usage and implement error handling to manage API limits gracefully.

Q4: Can I integrate this agent into an existing application?

A: Definitely! You can package the agent’s functionality into a module or API that can be integrated into larger applications or systems.

Q5: How can I ensure the data extracted is accurate and up-to-date?

A: Incorporate validation steps, use reliable data sources, and consider implementing time-stamped data retrieval to ensure freshness.

Additional Tips

Optimizing Performance

  • Caching Responses: Implement caching mechanisms to store and reuse responses from the LLM or web requests, reducing latency and API calls.
  • Asynchronous Processing: Use asynchronous programming to handle multiple web requests concurrently, improving efficiency.

Enhancing Data Quality

  • Data Cleaning: Apply data cleaning techniques to handle inconsistencies or missing values in the extracted data.
  • Error Handling: Implement robust error handling to catch exceptions and ensure the agent continues running smoothly.

Security Considerations

  • API Key Management: Store API keys securely using environment variables or a secrets manager.
  • Rate Limiting: Respect the terms of service for APIs used and implement rate limiting if necessary.

Final Thoughts

Building a data enrichment agent using LangGraph empowers you to automate complex data gathering tasks with ease. By adhering to Pythonic principles, you create a solution that is not only powerful but also elegant and maintainable.

The flexibility of LangGraph allows you to adapt the agent to various domains, making it a valuable tool for researchers, data scientists, and developers alike. As AI and machine learning continue to evolve, tools like LangGraph will play a crucial role in harnessing the vast amount of information available on the web.

Thank you for reading! If you have any questions or would like to share your experiences, please leave a comment below. If you found this article helpful, please clap and share it with others.

*Follow me on Medium for more articles on AI, Python, and data engineering.*

--

--

KevinLuo

知曉很多種資料處理,可BI或AI化的軟體和工具。主要用的程式語言是python和R 偶爾用C++ Ig:(可在上面找到我) AIA第九屆經理人班 立志當個厲害的podcaster!