Skip to main content
team-collaboration/version-control/githubGithub

PDF Download/ Scan Metadata Tool

This repository contains a Python tool that automates the process of downloading all PDF documents from a given webpage and extracting their metadata. The extracted metadata is then saved into a CSV file.

Only for Testing Purposes

This tool is intended for educational and authorized penetration testing purposes only. Unauthorized use of this tool against systems that you do not have explicit permission to test is illegal and unethical.

Table of Contents

Features

This tool offers the following features:

  • Download PDFs: Automatically downloads all PDF files found on a given webpage.
  • Extract Metadata: Extracts metadata from each PDF, including the following fields:
    • Title
    • Author
    • Creator
    • Created (Creation Date)
    • Modified (Modification Date)
    • Subject
    • Keywords
    • Description
    • Producer
    • PDF Version
  • Save to CSV: The extracted metadata is saved into a CSV file, where each row represents one PDF document.

Getting Started

Prerequisites

Before running the script, make sure you have the following installed:

  • Python 3.7 or higher
  • Python libraries:
    • requests
    • beautifulsoup4
    • PyPDF2

You can install these dependencies using pip:

pip install requests beautifulsoup4 PyPDF2

Installation

Clone the Repository:

git clone https://github.com/yourusername/metascan.git
cd metascan

Usage

This tool is run via the command line and offers a few options for customizing the operation.

Command-Line Options

OptionDescriptionRequired
-uThe URL of the webpage to scan for PDF filesx
-nThe name of the output CSV filex

Examples

Download PDFs and Extract Metadata from a Webpage

To download all PDFs from a given webpage and extract their metadata:

python metascan.py -u https://example.com -n output.csv
  • This command will scan https://example.com for PDF files, download them, and save their metadata to a file called output.csv.

Specifying a Different Output File

You can specify a different name for the output CSV file:

python metascan.py -u https://example.com -n my_metadata.csv
  • This will save the metadata to my_metadata.csv instead of the default output.csv.

Output CSV

The output CSV file will contain metadata for each PDF file in a structured format. Each row corresponds to one PDF, and the following fields (columns) are included:

Example CSV Output

The CSV file generated by the tool will look like the table below, with semicolons (;) separating the values:

TitleAuthorCreatorCreatedModifiedSubjectKeywordsDescriptionProducerPDF Version
Sample PDFJohn DoePDF Tool2022-01-012022-01-05ReportDataSample fileAdobe1.7
Example DocJane RoePDFGen2023-05-032023-05-10InvoiceBillingInvoice fileLibreOffice1.6
Test PDFNoneNone2020-12-122020-12-13ManualNoneUser manualFoxit1.4
Report 2022Mark SmithReportGen2022-02-152022-02-16AnnualReportYearly ReportAdobe1.7

The entries in the CSV file are separated by semicolons (;).

Error Handling

In cases where a PDF file cannot be read or if the metadata is incomplete, the tool will log the error and move on to the next file without stopping the entire process. It will also print a message in the terminal indicating which files had issues.

Example error message:

Error reading metadata from pdf_downloads/document.pdf: EOF marker not found