How to merge PDF documents using Python
It’s easy to merge PDF documents using Python code, if you use the right tools!
Nowadays PDF format is the most popular file format to share documents on the internet. Many software is available to manage PDF documents, but when you need to perform some easy task on them, it is not so easy to find the best one to use. One example is merging multiple PDF files into one file.
In this guide, we will show how you can merge many PDF documents into one file using a simple Python script. In particular, to merge two PDF documents in Python we need to:
- Install PyPDF2;
- retreive all the input PDFs inside a folder and subfolder;
- write the Python code to merge PDF documents
Let’s now see in the following sections how to do it in more details.
Prerequisites
First, we need to make sure that our development environment meets all the minimum requirements we need. In particular, we should already have installed:
- Python >= 3.x
- Pip (Python’s package installer)
After we met all the minimum requirements, we can move on to the next steps.
Install PyPDF2 using Pip
In Python, there are different libraries that allow us to manipulate PDF documents, and some of them are pretty complex to use. In this guide, we use PyPDF2, which is a simple Python library that we can use also to merge multiple PDF documents.
We can use Pip, the Python’s package installer, to install PyPDF2. To do so, we simply need to run the following command:
python3 -m pip install PyPdf2==1.26.0
Merging Pdf Documents inside a Subfolder
When you have to merge lots of PDF documents, it is really annoying to pass all the absolute paths of the files. So, to save time in coding the paths we can just put all the PDF files in a parent directory, and then Python will handle automatically all the paths of the PDF files for us.
To do so, make sure your project structure looks like this:
\main.py
\parent_folder
\folder_1
file_1.pdf
file_2.pdf
\folder_2
file_3.pdf
\folder_3
\folder_4
file_4.pdf
file_5.pdf
file_6.pdf
Note that all the PDF files should be inside the \parent_folder
.
Create the Python script to merge PDF documents
After setting up the environment, it’s finally time to code our script.
Python’s code to extract all the files under the Parent_folder
To get all the paths of the PDF files under the parent folder we can just use the following code:
import os
# pass the path of the parent_folder
def fetch_all_files(parent_folder: str):
target_files = []
for path, subdirs, files in os.walk(parent_folder):
for name in files:
target_files.append(os.path.join(path, name))
return target_files
# get a list of all the paths of the pdf
extracted_files = fetch_all_files('./parent_folder')
Basically, we use the fetch_all_files
function to recursively find all PDF files located into the parent_folder
and it’s subfolders.
Python’s code to Merge the PDF documents
Now that we have the list of all the PDF files paths, we can use the following code to merge them into a unique PDF file:
from PyPDF2 import PdfFileMerger
# pass the path of the output final file.pdf and the list of paths
def merge_pdf(out_path: str, extracted_files: list [str]):
merger = PdfFileMerger()
for pdf in extracted_files:
merger.append(pdf)
merger.write(out_path)
merger.close()
merge_pdf('./final.pdf', extracted_files)
Complete code to merge PDFs in Python
Simple as that! We can now put all the final code in a ./main.py
file:
#main.py
import os
from PyPDF2 import PdfFileMerger
# pass the path of the parent_folder
def fetch_all_files(parent_folder: str):
target_files = []
for path, subdirs, files in os.walk(parent_folder):
for name in files:
target_files.append(os.path.join(path, name))
return target_files
# pass the path of the output final file.pdf and the list of paths
def merge_pdf(out_path: str, extracted_files: list [str]):
merger = PdfFileMerger()
for pdf in extracted_files:
merger.append(pdf)
merger.write(out_path)
merger.close()
# get a list of all the paths of the pdf
parent_folder_path = './parent_folder'
outup_pdf_path = './final.pdf'
extracted_files = fetch_all_files(parent_folder_path)
merge_pdf(outup_pdf_path, extracted_files)
To run the script, you just need to type python3 main.py
in your terminal!
Conclusions
In this article, we have learned to merge different PDF files into a unique PDF using Python code. We used the simple Python library PyPDF2 to manipulate the PDFs and we wrote some code to collect multiple PDFs paths under different subfolders.
For more details about what you can do with PyPDF2, you can check its official documentation: