Link: Github Repository
Reach me out via LinkedIn, Portfolio Contact Form or mail@pascal-nehlsen.de
PDF Download/ Scan Metadata Tool
This repository contains a Python tool that automates the process of downloading all PDF documents from a given webpage and extracting their metadata. The extracted metadata is then saved into a CSV file.
This tool is intended for educational and authorized penetration testing purposes only. Unauthorized use of this tool against systems that you do not have explicit permission to test is illegal and unethical.
Table of Contents
Features
This tool offers the following features:
- Download PDFs: Automatically downloads all PDF files found on a given webpage.
- Extract Metadata: Extracts metadata from each PDF, including the following fields:
- Title
- Author
- Creator
- Created (Creation Date)
- Modified (Modification Date)
- Subject
- Keywords
- Description
- Producer
- PDF Version
- Save to CSV: The extracted metadata is saved into a CSV file, where each row represents one PDF document.
Getting Started
Prerequisites
Before running the script, make sure you have the following installed:
- Python 3.7 or higher
- Python libraries:
requestsbeautifulsoup4PyPDF2
You can install these dependencies using pip:
pip install requests beautifulsoup4 PyPDF2
Installation
Clone the Repository:
git clone https://github.com/yourusername/metascan.git
cd metascan
Usage
This tool is run via the command line and offers a few options for customizing the operation.
Command-Line Options
| Option | Description | Required |
|---|---|---|
-u | The URL of the webpage to scan for PDF files | x |
-n | The name of the output CSV file | x |
Examples
Download PDFs and Extract Metadata from a Webpage
To download all PDFs from a given webpage and extract their metadata:
python metascan.py -u https://example.com -n output.csv
- This command will scan
https://example.comfor PDF files, download them, and save their metadata to a file calledoutput.csv.
Specifying a Different Output File
You can specify a different name for the output CSV file:
python metascan.py -u https://example.com -n my_metadata.csv
- This will save the metadata to
my_metadata.csvinstead of the defaultoutput.csv.
Output CSV
The output CSV file will contain metadata for each PDF file in a structured format. Each row corresponds to one PDF, and the following fields (columns) are included:
Example CSV Output
The CSV file generated by the tool will look like the table below, with semicolons (;) separating the values:
| Title | Author | Creator | Created | Modified | Subject | Keywords | Description | Producer | PDF Version |
|---|---|---|---|---|---|---|---|---|---|
| Sample PDF | John Doe | PDF Tool | 2022-01-01 | 2022-01-05 | Report | Data | Sample file | Adobe | 1.7 |
| Example Doc | Jane Roe | PDFGen | 2023-05-03 | 2023-05-10 | Invoice | Billing | Invoice file | LibreOffice | 1.6 |
| Test PDF | None | None | 2020-12-12 | 2020-12-13 | Manual | None | User manual | Foxit | 1.4 |
| Report 2022 | Mark Smith | ReportGen | 2022-02-15 | 2022-02-16 | Annual | Report | Yearly Report | Adobe | 1.7 |
The entries in the CSV file are separated by semicolons (;).
Error Handling
In cases where a PDF file cannot be read or if the metadata is incomplete, the tool will log the error and move on to the next file without stopping the entire process. It will also print a message in the terminal indicating which files had issues.
Example error message:
Error reading metadata from pdf_downloads/document.pdf: EOF marker not found