Document Summarization API

Welcome to the Document Summarization API, a powerful Flask-based backend paired with a modern, lightning-fast Vite-powered front end. This project generates concise summaries from PDFs and images using state-of-the-art NLP models including Pegasus, BERT, and LegalBERT.

The intelligent model selection mechanism chooses the best summarizer for general or legal documents. The sleek, responsive front end enables seamless file uploads, summary customization, and results display, all built with modern JavaScript tooling.

Features

Multi-Model Summarization:
Uses Pegasus (abstractive), BERT (extractive), and LegalBERT (legal extractive) models.
Smart Model Selection:
TF-IDF-based logic to auto-select the best summarization approach.
File Type Support:
Accepts PDFs, PNGs, and JPEGs; extracts text with pdfplumber and pytesseract.
Customizable Summaries:
Choose between normal or long summary lengths for Pegasus.
Fine-Tuning Ready:
Supports training on CNN/DailyMail and BillSum datasets.
Modern Front End:
Built with Vite and [React/Vue/Vanilla JS] for fast, responsive UX.
RESTful API:
Seamless backend integration and scalability.
Modular Architecture:
Clean codebase for easy extension and maintenance.

Prerequisites

Python 3.8+
Node.js 18+ and npm
Tesseract OCR
- Ubuntu: sudo apt-get install tesseract-ocr
- macOS: brew install tesseract
- Windows: Tesseract at UB Mannheim
requirements.txt Python dependencies
A modern web browser (Chrome, Firefox)

Note on Datasets and Fine-Tuned Models

Datasets:
Not included in the repo. Download and place in paths specified in config.py:
- CNN/DailyMail: Hugging Face
- BillSum: BillSum GitHub
Fine-Tuned Models:
Folder fine_tuned_models/ is not included. If missing, models are trained on first run (with datasets in place).

⚠️ GPU strongly recommended for fine-tuning.

Installation

1. Clone the Repository

git clone https://github.com/your-username/document-summarization-api.git
cd document-summarization-api

2. Set Up the Backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

3. Install Tesseract OCR

Follow the instructions under Prerequisites for your platform.

4. Set Up the Vite Front End

cd frontend
npm install
npm run dev

5. Prepare Datasets

Download and update config.py:

CNN_DAILYMAIL_PATH = "path/to/cnn_dailymail/train.csv"
BILLSUM_PATH = "path/to/billsum/us_train_data_final_OFFICIAL.jsonl" 6. Run the Application

In the project root

python app.py Runs at http://0.0.0.0:5000 Make sure the Vite front end is running in a separate terminal:

cd frontend
npm run dev

🧪 Usage 🔹 Front-End Interface

Access: http://localhost:5173
Upload: Drag-and-drop or select a PDF/PNG/JPEG file (max 100 MB)
Select: Summary length (normal or long)
Result: View the summary and the model used

Features:

Responsive design
Drag-and-drop support
Real-time loading indicator
Clear summary display with model info

API Endpoint

POST /summarize

Request Content-Type: multipart/form-data

Fields:

file: PDF, PNG, or JPEG

length (optional): normal or long for Pegasus

curl -X POST -F "file=@document.pdf" -F "length=normal" http://localhost:5000/summarize

{
  "model_used": "pegasus",
  "summary": "This is a concise summary of the uploaded document."
}

Project Structure

document-summarization-api/
├── app.py # Main Flask backend
├── config.py # Config (dataset paths, constants)
├── frontend/ # Vite-powered front end
│ ├── src/ # Source files (React/Vue/Vanilla components)
│ ├── package.json # Front-end dependencies
│ ├── vite.config.js # Vite configuration
├── models/
│ ├── model_loader.py # Load/fine-tune models
│ ├── summarizer.py # Summarization logic
├── utils/
│ ├── text_extraction.py # PDF/image text extraction
│ ├── model_selector.py # Model selection logic
├── uploads/ # Temp uploaded files (auto-created)
├── fine_tuned_models/ # Trained models (auto-created)
├── requirements.txt # Python backend dependencies
├── README.md

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact

For questions or feedback, reach out via:

GitHub Issues

Email: praful101nayak@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Document Summarization API

Features

Prerequisites

Note on Datasets and Fine-Tuned Models

Installation

1. Clone the Repository

2. Set Up the Backend

3. Install Tesseract OCR

4. Set Up the Vite Front End

5. Prepare Datasets

In the project root

API Endpoint

Project Structure

License

Contact

About

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
models		models
summarizer_frontend		summarizer_frontend
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
config.py		config.py
requirements.txt		requirements.txt

License

prafulsirgit/full_text_summarization

Folders and files

Latest commit

History

Repository files navigation

Document Summarization API

Features

Prerequisites

Note on Datasets and Fine-Tuned Models

Installation

1. Clone the Repository

2. Set Up the Backend

3. Install Tesseract OCR

4. Set Up the Vite Front End

5. Prepare Datasets

In the project root

API Endpoint

Project Structure

License

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages