Github

Reach me out via LinkedIn, Portfolio Contact Form or mail@pascal-nehlsen.de

PDF Download/ Scan Metadata Tool

This repository contains a Python tool that automates the process of downloading all PDF documents from a given webpage and extracting their metadata. The extracted metadata is then saved into a CSV file.

Only for Testing Purposes

This tool is intended for educational and authorized penetration testing purposes only. Unauthorized use of this tool against systems that you do not have explicit permission to test is illegal and unethical.

Features
Getting Started
- Prerequisites
- Installation
Usage
- Command-Line Options
- Examples
Output CSV
Error Handling

Features

This tool offers the following features:

Download PDFs: Automatically downloads all PDF files found on a given webpage.
Extract Metadata: Extracts metadata from each PDF, including the following fields:
- Title
- Author
- Creator
- Created (Creation Date)
- Modified (Modification Date)
- Subject
- Keywords
- Description
- Producer
- PDF Version
Save to CSV: The extracted metadata is saved into a CSV file, where each row represents one PDF document.

Getting Started

Prerequisites

Before running the script, make sure you have the following installed:

Python 3.7 or higher
Python libraries:
- requests
- beautifulsoup4
- PyPDF2

You can install these dependencies using pip:

pip install requests beautifulsoup4 PyPDF2

Installation

Clone the Repository:

git clone https://github.com/yourusername/metascan.git
cd metascan

Usage

This tool is run via the command line and offers a few options for customizing the operation.

Command-Line Options

Option	Description	Required
`-u`	The URL of the webpage to scan for PDF files	x
`-n`	The name of the output CSV file	x

Examples

Download PDFs and Extract Metadata from a Webpage

To download all PDFs from a given webpage and extract their metadata:

python metascan.py -u https://example.com -n output.csv

This command will scan https://example.com for PDF files, download them, and save their metadata to a file called output.csv.

Specifying a Different Output File

You can specify a different name for the output CSV file:

python metascan.py -u https://example.com -n my_metadata.csv

This will save the metadata to my_metadata.csv instead of the default output.csv.

Output CSV

The output CSV file will contain metadata for each PDF file in a structured format. Each row corresponds to one PDF, and the following fields (columns) are included:

Example CSV Output

The CSV file generated by the tool will look like the table below, with semicolons (;) separating the values:

Title	Author	Creator	Created	Modified	Subject	Keywords	Description	Producer	PDF Version
Sample PDF	John Doe	PDF Tool	2022-01-01	2022-01-05	Report	Data	Sample file	Adobe	1.7
Example Doc	Jane Roe	PDFGen	2023-05-03	2023-05-10	Invoice	Billing	Invoice file	LibreOffice	1.6
Test PDF	None	None	2020-12-12	2020-12-13	Manual	None	User manual	Foxit	1.4
Report 2022	Mark Smith	ReportGen	2022-02-15	2022-02-16	Annual	Report	Yearly Report	Adobe	1.7

The entries in the CSV file are separated by semicolons (;).

Error Handling

In cases where a PDF file cannot be read or if the metadata is incomplete, the tool will log the error and move on to the next file without stopping the entire process. It will also print a message in the terminal indicating which files had issues.

Example error message:

Error reading metadata from pdf_downloads/document.pdf: EOF marker not found

Table of Contents​

Features​

Getting Started​

Prerequisites​

Installation​

Usage​

Command-Line Options​

Examples​

Output CSV​

Example CSV Output​

Error Handling​

Table of Contents