In in the present day’s digital panorama, content material repurposing has change into essential for maximizing attain and engagement. One efficient technique is remodeling long-form content material like weblog posts into partaking Twitter threads. Nonetheless, manually creating these threads will be time-consuming and difficult. On this article, we’ll discover tips on how to construct an software to automate weblog to Twitter thread creation utilizing Google’s Gemini-2.0 LLM, ChromaDB, and Streamlit.
Studying Aims
- Automate weblog to Twitter thread transformation utilizing Google’s Gemini-2.0, ChromaDB, and Streamlit for environment friendly content material repurposing.
- Achieve hands-on expertise to construct automate weblog to Twitter thread with embedding fashions and AI-driven immediate engineering.
- Perceive the capabilities of Google’s Gemini-2.0 LLM for automated content material transformation.
- Discover the combination of ChromaDB for environment friendly semantic textual content retrieval.
- Construct a Streamlit-based net software for seamless PDF-to-Twitter thread conversion.
- Achieve hands-on expertise with embedding fashions and immediate engineering for content material era.
This text was printed as part of the Knowledge Science Blogathon.
What’s Gemini-2.0?
Gemini-2.0 is Google’s newest multimodal Giant Language Mannequin (LLM), representing a big development in AI capabilities. It’s now accessible as Gemini-2.0-flash-exp API in Vertext AI Studio. It affords improved efficiency in areas like:
- Multimodal understanding , coding, complicated directions following and performance calling in pure language.
- Context-aware content material creation.
- Advanced reasoning and evaluation.
- It has native picture era, picture enhancing, controllable text-to-speech era.
- Low-latency responses with the Flash variant.
For our mission, we’re particularly utilizing the gemini-2.0-flash-exp mannequin API, which is optimized for fast response whereas sustaining high-quality output.
What’s the ChromaDB Vector Database?
ChromaDB is an open-source embedding database that excels at storing and retrieving vector embeddings. It’s a high-performance database designed for environment friendly storing, looking, and managing embeddings generated by AI fashions. It allows similarity searches by indexing and evaluating vectors based mostly on their proximity to different comparable vectors in multidimensional house.
- Environment friendly comparable search capabilities
- Simple integration with common embedding fashions
- Native storage and persistence
- Versatile querying choices
- Light-weight deployment
In our software, ChromaDB is the spine for storing and retrieving related chunks of textual content based mostly on semantic similarity, enabling extra contextual and correct thread era.
What’s Streamlit UI?
Streamlit is an open-source Python library designed to shortly construct interactive and data-driven net purposes for AI/ML tasks. Its give attention to simplicity allows builders to create visually interesting and practical apps with minimal effort.
Key Options:
- Ease of Use: Builders can flip Python scripts into net apps with a number of strains of code.
- Widgets: It affords a variety of enter widgets (sliders, dropdowns, textual content inputs) to make purposes interactive.
- Knowledge Visualization: It Helps integration with common Python libraries like Matplotlib, Plotly, and Altair for dynamic viz.
- Actual-time Updates: Mechanically rerun apps when code or enter adjustments, offering a seamless person expertise.
- No Internet Growth Required: Take away the necessity to be taught HTML, CSS, or Javascript.
Utility of StreamLit
Streamlit is widley used for constructing bashboards, exploratory information evaluation instruments, AI/ML software prototypes. Its simplicity and interactivity makes it splendid for fast prototying and sharing insights with non-technical stakeholders. We’re utilizing streamlit for desiging the interface for the our software.
Motivation for Tweet Era Automation
The first motivation behind automating tweet thread era embody:
- Time effectivity: Decreasing the annual effort required to create partaking Twitter threads.
- Consistency: Sustaining a constant voice and format throughout all threads.
- Scalability: Processing a number of article shortly and effectively.
- Enhanced engagement: Leveraging AI to create extra compelling and shareable content material.
- Content material optimization: Utilizing data-driven approaches to construction threads successfully.
Undertaking Environmental Setup Utilizing Conda
To arrange the mission setting, comply with these steps:
#create a brand new conda env
conda create -n tweet-gen python=3.11
conda activate tweet-gen
Set up required packages
pip set up langchain langchain-community langchain-google-genai
pip set up chromadb streamlit python-dotenv pypdf pydantic
Now create a mission folder named BlogToTweet or no matter you would like.
Additionally, create a .env file in your mission root. Get your GOOGLE API KEY from right here and put it within the .env file.
GOOGLE_API_KEY="<your API KEY>"
We’re all set as much as dive into the primary implementation half.
Undertaking Implementation
In our mission, there are 4 necessary information every having its performance for higher improvement.
- Providers: For placing all of the necessary companies in it.
- fashions: For all of the necessary Pydantic information fashions.
- foremost: For testing the automation within the terminal.
- app: For Streamlit UI implementation.
Implementing Fashions
We are going to begin with implementing Pydantic information fashions within the fashions.py file. What’s Pydantic? learn this.
from typing import Elective, Record
from pydantic import BaseModel
class ArticleContent(BaseModel):
title: str
content material: str
creator: Elective[str]
url: str
class TwitterThread(BaseModel):
tweets: Record[str]
hashtags: Record[str]
It’s a easy but necessary mannequin that can give the article content material and all of the tweets a constant construction.
Implementing Providers
The ContentRepurposer handles the core performance of the applying. Right here is the skeletal construction of that class.
# companies.py
import os
from dotenv import load_dotenv
from typing import Record
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_community.vectorstores import Chroma
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from fashions import ArticleContent, TwitterThread
class ContentRepurposer:
def __init__(self, content material):
go
def process_pdf(self, pdf_path: str) -> ArticleContent:
go
def get_relevant_chunk(self, question: str, ok: int = 3) -> Record[str]:
go
def generate_twitter_thread(self, article: ArticleContent):
go
def process_article(self, pdf_path: str):
go
Within the preliminary technique, we’ll put all necessary parameters of the category
def __init__(self):
from pydantic import SecretStr
google_api_key = os.getenv("GOOGLE_API_KEY")
if google_api_key is None:
elevate ValueError("GOOGLE_API_KEY setting variable will not be set")
_google_api_key = SecretStr(google_api_key)
# Initialize Gemini mannequin and embeddings
self.embeddings = GoogleGenerativeAIEmbeddings(
mannequin="fashions/embedding-001",
)
self.llm = ChatGoogleGenerativeAI(
mannequin="gemini-2.0-flash-exp",
temperature=0.7)
# Initialize textual content splitter
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["nn", "n", " ", ""]
)
Right here, we use Pydantic SecretStr for the safe use of the API_KEY, for embedding our articles we use the GoogleGenerativeAIEmbeddings perform utilizing the embedding-001 mannequin. To create the tweets from the article we’ll use the ChatGoogleGenerativeAI perform with the most recent Gemini-2.0-flash-exp mannequin. RecursiveCharacterTextSplitter is used for splitting a big doc into elements right here we break up the doc in chunk_size 1000 with 200 character overlap.
Processing PDF
The system processes PDFs utilizing PyPDFLoader from LangChain and implements textual content chunking.
def process_pdf(self, pdf_path: str) -> ArticleContent:
"""Course of native PDF and create embeddings"""
# Load PDF
loader = PyPDFLoader(pdf_path)
pages = loader.load()
# Extract textual content
textual content = " ".be part of(web page.page_content for web page in pages)
# Cut up textual content into chunks
chunks = self.text_splitter.split_text(textual content)
# Create and retailer embeddings in Chroma
self.vectordb = Chroma.from_texts(
texts=chunks,
embedding=self.embeddings,
persist_directory="./information/chroma_db"
)
# Extract title and creator
strains = [line.strip() for line in text.split("n") if line.strip()]
title = strains[0] if strains else "Untitled"
creator = strains[1] if len(strains) > 1 else None
return ArticleContent(
title=title,
content material=textual content,
creator=creator,
url=pdf_path
)
Within the above code, we implement the PDF processing performance of the applying.
- Load and Extract PDF Textual content: The PyPDFLoader reads the PDF file and extracts the textual content content material from all pages, concatenating it right into a single string.
- Cut up Textual content into Chunks: The textual content is split into smaller chunks utilizing the text_splitter for higher processing and bedding creation.
- Generate Embeddings: Chroma creates vector embeddings from the textual content chunks and shops them in a persistent database listing.
- Extract Title and Creator: The primary non-empty line is used because the title, and the second because the creator.
- Return Article Content material: Assemble an Article Content material object containing the title, full textual content, creator, and file path.
Getting the related Chunk
def get_relevant_chunks(self, question: str, ok: int = 3) -> Record[str]:
"""Retrieve related chunks from vector database"""
outcomes = self.vectordb.similarity_search(question, ok=ok)
return [doc.page_content for doc in results]
This code retrieves the highest ok (default 3) most related textual content chunks from the vector database based mostly on similarity to the given question.
Producing Tweet thread from Article
This technique is an important as a result of right here we’ll use all of the generative AI, embedding, and prompts collectively to generate the Thread from the shopper’s PDF file.
def generate_twitter_thread(self, article: ArticleContent) -> TwitterThread:
"""Generate Twitter thread utilizing Gemini"""
# First, get probably the most related chunks for various points
intro_chunks = self.get_relevant_chunks("introduction and details")
technical_chunks = self.get_relevant_chunks("technical particulars and implementation")
conclusion_chunks = self.get_relevant_chunks("conclusion and key takeaways")
thread_prompt = PromptTemplate(
input_variables=["title", "intro", "technical", "conclusion"],
template="""
Write an enticing Twitter thread (8-10 tweets) summarizing this technical article in an approachable and human-like model.
Title: {title}
Introduction Context:
{intro}
Technical Particulars:
{technical}
Key Takeaways:
{conclusion}
Tips:
1. Begin with a hook that grabs consideration (e.g., a stunning reality, daring assertion, or thought-provoking query).
2. Use a conversational tone and clarify complicated particulars merely, with out jargon.
3. Embody concise tweets beneath 280 characters, following the 1/n numbering format.
4. Break down the important thing insights logically, and make every tweet construct curiosity for the subsequent one.
5. Embody related examples, analogies, or comparisons to assist understanding.
6. Finish the thread with a robust conclusion and a name to motion (e.g., "Learn the total article," "Observe for extra insights").
7. Make it relatable, academic, and interesting.
Output format:
- A numbered record of tweets, with every tweet on a brand new line.
- After the tweets, recommend 3-5 hashtags that summarize the thread, beginning with #.
"""
)
chain = LLMChain(llm=self.llm, immediate=thread_prompt)
outcome = chain.run({
"title": article.title,
"intro": "n".be part of(intro_chunks),
"technical": "n".be part of(technical_chunks),
"conclusion": "n".be part of(conclusion_chunks)
})
# Parse the outcome into tweets and hashtags
strains = outcome.break up("n")
tweets = [line.strip() for line in lines if line.strip() and not line.strip().startswith("#")]
hashtags = [tag.strip() for tag in lines if tag.strip().startswith("#")]
# Guarantee we've at the least one tweet and hashtag
if not tweets:
tweets = ["Thread about " + article.title]
if not hashtags:
hashtags = ["#AI", "#TechNews"]
return TwitterThread(tweets=tweets, hashtags=hashtags)
Let’s perceive what is occurring within the above code step-by-step
- Retrieve Related Chunks: The tactic first extracts related chunks of textual content for the introduction, technical particulars, and conclusion utilizing the get_relevant_chunks technique.
- Put together a Immediate: A PromptTemplate is created with directions to jot down an enticing Twitter thread summarizing the article, together with particulars on tone, construction, and formatting tips.
- Run the LLM Chain: The LLMChain is used with the LLM fashions to course of the immediate and generate a thread based mostly on the article’s title and extracted chunks.
- Parse Outcomes: The generated output is break up into tweets and hashtags, making certain correct formatting and extracting the required elements.
- Return Twitter Thread: The tactic returns a TwitterThread object containing the formatted tweets and hashtags.
Course of The Article
This technique processes a PDF file to extract its content material and generates a Twitter thread summarizing it. and final it’ll return a Twitter Thread.
def process_article(self, pdf_path: str) -> TwitterThread:
"""Most important technique to course of article and generate content material"""
attempt:
article = self.process_pdf(pdf_path)
thread = self.generate_twitter_thread(article)
return thread
besides Exception as e:
print(f"Error processing article: {str(e)}")
elevate
Upto right here We applied all the required code for this mission, now there are two methods we will proceed additional.
- Implementing the Most important file for testing and
- Implementing Streamlit Utility for the online interface
When you don’t wish to take a look at the applying in terminal mode then you’ll be able to skip the Most important file implementation and go on to the Streamlit Utility implementation.
Implementing the Most important file for testing
Now, we put collectively all of the modules to check the applying.
import os
from dotenv import load_dotenv
from companies import ContentRepurposer
def foremost():
# Load setting variables
load_dotenv()
google_api_key = os.getenv("GOOGLE_API_KEY")
if not google_api_key:
elevate ValueError("GOOGLE_API_KEY setting variable not discovered")
# Initialize repurposer
repurposer = ContentRepurposer()
# Path to your native PDF
# pdf_path = "information/guide_to_jax.pdf"
pdf_path = "information/build_llm_powered_app.pdf"
attempt:
thread = repurposer.process_article(pdf_path)
print("Generated Twitter Thread:")
for i, tweet in enumerate(thread.tweets, 1):
print(f"nTweet {i}/{len(thread.tweets)}:")
print(tweet)
print("nSuggested Hashtags:")
print(" ".be part of(thread.hashtags))
besides Exception as e:
print(f"Did not course of article: {str(e)}")
if __name__ == "__main__":
foremost()
Right here, you’ll be able to see that it merely imports all of the modules, Verify the GOOGLE_API_KEY availability, initiates ContentRepuposer() class, after which within the attempt block creates a thread by calling the process_article() technique from the repurposer object. On the final, some printing strategies for tweets printing on the terminal and the Exception dealing with.
To check the applying, create a folder named information in your mission root and put your downloaded PDF there. To obtain the article from AnalyticsVidya, go to any article click on the obtain button, and obtain it.
Now in your terminal,
python foremost.py
Instance Weblog 1 Output
Instance Weblog 2 Output
I feel you get the concept of how lovely the applying is! Let’s make it extra aesthetically sensible.
Implementing the Streamlit APP
Now we’ll do just about the identical as above in a extra UI-centric manner.
Importing Libraries and Env Configuration
import os
import streamlit as st
from dotenv import load_dotenv
from companies import ContentRepurposer
import pyperclip
from pathlib import Path
# Load setting variables
load_dotenv()
# Set web page configuration
st.set_page_config(page_title="Content material Repurposer", page_icon="🐦", format="vast")
Customized CSS
# Customized CSS
st.markdown(
"""
<model>
.tweet-box {
background-color: #181211;
border: 1px stable #e1e8ed;
border-radius: 10px;
padding: 15px;
margin: 10px 0;
}
.copy-button {
background-color: #1DA1F2;
shade: white;
border: none;
border-radius: 5px;
padding: 5px 10px;
cursor: pointer;
}
.main-header {
shade: #1DA1F2;
text-align: heart;
}
.hashtag {
shade: #1DA1F2;
background-color: #E8F5FE;
padding: 5px 10px;
border-radius: 15px;
margin: 5px;
show: inline-block;
}
</model>
""",
unsafe_allow_html=True,
)
Right here, we’ve made some CSS styling for the online pages (tweets, copy buttons, hashtags) is CSS complicated to you? go to W3Schools
Some Necessary Features
def create_temp_pdf(uploaded_file):
"""Create a short lived PDF file from uploaded content material"""
temp_dir = Path("temp")
temp_dir.mkdir(exist_ok=True)
temp_path = temp_dir / "uploaded_pdf.pdf"
with open(temp_path, "wb") as f:
f.write(uploaded_file.getvalue())
return str(temp_path)
def initialize_session_state():
"""Initialize session state variables"""
if "tweets" not in st.session_state:
st.session_state.tweets = None
if "hashtags" not in st.session_state:
st.session_state.hashtags = None
def copy_text_and_show_success(textual content, success_key):
"""Copy textual content to clipboard and present success message"""
attempt:
pyperclip.copy(textual content)
st.success("Copied to clipboard!", icon="✅")
besides Exception as e:
st.error(f"Failed to repeat: {str(e)}")
Right here, the create_temp_pdf() technique will create a temp listing within the mission folder and can put the uploaded PDF there for your complete course of.
initialize_session_state() technique will examine whether or not the tweets and hashtags are within the Streamlit session or not.
The copy_text_and_show_success() technique will use the Pyperclip library to repeat the tweets and hashtags to the clipboard and present that the copy was profitable.
Most important Perform
def foremost():
initialize_session_state()
# Header
st.markdown(
"<h1 class="main-header">📄 Content material to Twitter Thread 🐦</h1>",
unsafe_allow_html=True,
)
# Create two columns for format
col1, col2 = st.columns([1, 1])
with col1:
st.markdown("### Add PDF")
uploaded_file = st.file_uploader("Drop your PDF right here", sort=["pdf"])
if uploaded_file:
st.success("PDF uploaded efficiently!")
if st.button("Generate Twitter Thread", key="generate"):
with st.spinner("Producing Twitter thread..."):
attempt:
# Get Google API key
google_api_key = os.getenv("GOOGLE_API_KEY")
if not google_api_key:
st.error(
"Google API key not discovered. Please examine your .env file."
)
return
# Save uploaded file
pdf_path = create_temp_pdf(uploaded_file)
# Course of PDF and generate thread
repurposer = ContentRepurposer()
thread = repurposer.process_article(pdf_path)
# Retailer leads to session state
st.session_state.tweets = thread.tweets
st.session_state.hashtags = thread.hashtags
# Clear up momentary file
os.take away(pdf_path)
besides Exception as e:
st.error(f"Error producing thread: {str(e)}")
with col2:
if st.session_state.tweets:
st.markdown("### Generated Twitter Thread")
# Copy total thread part
st.markdown("#### Copy Full Thread")
all_tweets = "nn".be part of(st.session_state.tweets)
if st.button("📋 Copy Complete Thread"):
copy_text_and_show_success(all_tweets, "thread")
# Show particular person tweets
st.markdown("#### Particular person Tweets")
for i, tweet in enumerate(st.session_state.tweets, 1):
tweet_col1, tweet_col2 = st.columns([4, 1])
with tweet_col1:
st.markdown(
f"""
<div class="tweet-box">
<p>{tweet}</p>
</div>
""",
unsafe_allow_html=True,
)
with tweet_col2:
if st.button("📋", key=f"tweet_{i}"):
copy_text_and_show_success(tweet, f"tweet_{i}")
# Show hashtags
if st.session_state.hashtags:
st.markdown("### Instructed Hashtags")
# Show hashtags with copy button
hashtags_text = " ".be part of(st.session_state.hashtags)
hashtags_col1, hashtags_col2 = st.columns([4, 1])
with hashtags_col1:
hashtags_html = " ".be part of(
[
f"<span class="hashtag">{hashtag}</span>"
for hashtag in st.session_state.hashtags
]
)
st.markdown(hashtags_html, unsafe_allow_html=True)
with hashtags_col2:
if st.button("📋 Copy Tags"):
copy_text_and_show_success(hashtags_text, "hashtags")
if __name__ == "__main__":
foremost()
When you learn this code carefully, you will note that Streamlit creates two columns: one for the PDF uploader perform and the opposite for exhibiting the generated tweets.
Within the first column, we’ve accomplished just about the identical because the earlier foremost.py with some further markdown, including buttons for importing and producing threads utilizing the Streamlit object.
Within the second column, Streamlit iterates the tweet record or generated thread, places every tweet in a tweet field, and creates a duplicate button for the person tweet, and within the final, it’ll present all of the hashtags and their copy buttons.
Now the enjoyable half!!
Open your terminal and sort
streamlit run .app.py
If the whole lot is finished proper It would begin a Streamlit software in your default browser.
Now, drag and drop your downloaded PDF on the field, it’ll mechanically add the PDF to the system, and click on on the Generate Twitter Thread button to generate tweets.
You’ll be able to copy full thread or particular person tweet utilizing respective copy buttons.
I hope doing hands-on tasks like these will assist you be taught many sensible ideas on Generative AI, Python libraries, and programming. Completely happy Coding, Keep wholesome.
All of the code used on this article is right here.
Conclusion
This mission demonstrates the ability of mixing fashionable AI applied sciences to automate content material repurposing. By leveraging Gemini-2.0 and ChromaDB, we’ve created a system that not solely saves time but additionally maintains high-quality output. The modular structure guarantee straightforward upkeep and extensibility, whereas the Streamlit interface makes it accessible to non-technical customers.
Key Takeaways
- The mission demonstrates profitable integration of cutting-edge AI instruments for practival content material automation.
- The structure’s modularity permits for simple upkeep and future enhancements, making it a sustainable answer for content material repurposing.
- The Streamlit interface makes the instrument accessible to content material creators with out technical experience, bridging the hole between complicated AI expertise and sensible utilization.
- The implementation can deal with varied content material sorts and volumes, making it appropriate for each particular person content material creators and huge organizations.
Often Requested Questions
A. The system makes use of RecursiveCharacterTextSplitter to interrupt down lengthy articles into manageable chunks, that are then embedded and saved in ChromaDB. When producing threads, it retrieves probably the most related chunk utilizing similarity search.
A. We used a temperature of 0.7, which supplied stability between creativity and coherence. You’ll be able to modify this setting based mostly on particular wants, with greater values (>0.7) producing extra inventive output and decrease values (<0.7) producing extra targeted content material.
A. The immediate template explicitly specifies the 280-character restrict, and the LLM is skilled to respect this constraint. You’ll be able to add further validation to make sure compliance programmatically.
The media proven on this article will not be owned by Analytics Vidhya and is used on the Creator’s discretion.