Issue 014
Jan 29, 2025 ยท 13 minute read
Here's the thing about PDFs - they're complex beasts that require quite a bit of thinking to properly parse - they come in all shapes and sizes, and they can contain a lot of different types of data and formatting. 90% of the time, we just want to extract the text from the file, but that's not always easy - for the remaining 10%, well we won't be covering that in this blog post.
If you've been in the Elixir world for long enough, you'll probably have tried to parse a PDF file and realised that it's not as easy as it seems. A quick look on the Elixir Forum will quickly show you that there is no simple way to do it.
Most people will tell you to upload the file to S3 and use a Lambda to handle the contents. Offloading to AWS Lambda might seem elegant at first ("Look, Ma, no dependencies!"), but it comes with its own baggage:
These aren't ideal solutions - and software engineering is already made more complicated than it needs to be at times - we don't need to add more complexity to the mix.
We need a robust, native solution that plays nicely with the BEAM. So how do we do that?
Elixir is my favourite language, but it can't do everything - web services, background jobs, and more are easy but sometimes we need a little help from our friends closer to the hardware for some of the tasks Elixir doesn't have a native solution for. That's where Rust and NIFs come in!
Rust is a systems programming language that is fast, safe, and easy to use. It's a great language for writing code that needs to be performant and reliable.
But Rust isn't just fast - it's "zero-cost abstractions" fast. What does that mean? You get high-level, ergonomic code that compiles down to something as efficient as hand-written C. For PDF parsing, where you're dealing with complex file formats and potentially large documents, this performance is a game-changer.
What is a NIF? A NIF (Native Implemented Function) is a way to call Rust code from Elixir - it's the BEAMs method of allowing processes to directly call native functions. It allows you to write code in Rust that can be called directly from Elixir, giving you the performance benefits of Rust without sacrificing the ease of use of Elixir.
For this blog post, we're going to be using the Extractous library which provides fast and efficient unstructured data extraction in Rust. This combined with the NIFs in Elixir gives us a powerful combination for parsing PDFs.
First things first, ensure you have Elixir and Rust installed on your machine.
Let's begin by creating a new LiveView Elixir application that will allow users to upload a PDF file and see a breakdown of the contents. We won't be needing any database functionality for this so we can use the --no-ecto
flag to skip the database setup.
mix phx.new elixir_pdf --no-ecto
We'll also need to add the rustler
dependency to our mix.exs
file so we can call Rust code from Elixir.
defp deps do
[
{:rustler, "~> 0.27.0"}
]
end
Once we pull down our dependencies using mix deps.get
, we can use mix rustler.new
to generate our new Rust project in our code.
If you head to lib/elixir_pdf/<name_of_your_rust_project>.ex
, you'll see that it's already generated a basic NIF for us. A default NIF implementation is provided for us, but we'll be implementing our own in the next step. I've named my Rust project rustreader
for this example.
defmodule RustReader do
use Rustler, otp_app: :elixir_pdf, crate: "rustreader"
# Define the function that will be implemented in Rust
def extract_pdf(_path), do: :erlang.nif_error(:nif_not_loaded)
end
Now, let's grab the extractous
library and add it to our native/rustreader/Cargo.toml
file - this will allow us to use the extractous
library in our Rust code.
[dependencies]
rustler = "0.36.0"
extractous = "0.2.0"
With this in place, we can run cargo build
to build our Rust code - this will also pull down the extractous
library and any other dependencies.
Next we need to actually write some Rust code to implement the extract_pdf
function in our native/rustreader/src/lib.rs
file.
use extractous::Extractor;
use rustler::{Encoder, Env, NifResult, Term};
#[rustler::nif(schedule = "DirtyCpu")]
fn extract_pdf(path: String) -> NifResult<(String, String)> {
let extractor = Extractor::new();
match extractor.extract_file_to_string(&path) {
Ok((content, metadata)) => Ok((content, format!("{:?}", metadata))),
Err(e) => Err(rustler::Error::Term(Box::new(format!("Extraction failed: {}", e))))
}
}
rustler::init!("Elixir.RustReader", [extract_pdf]);
This code will define a new instance of the Extractor
struct and use it to extract the contents of the PDF file. We'll then return the contents and the metadata as a tuple.
The magic of the rustler::init!
macro is that it will automatically generate the necessary code to call the Rust function from Elixir.
Astute observers will note our use of the DirtyCpu
schedule. This ingenious feature instructs Rustler and the BEAM to automatically schedule our task in a manner that prevents global blocking during execution. This functionality, known as a DirtyNif, significantly simplifies our work compared to the complexities of manual implementation in C.
Now we need to write a some simple LiveView Elixir code to allow users to upload a PDF file and then call our Rust function from the server.
defmodule ElixirPdfWeb.HomeLive do
use ElixirPdfWeb, :live_view
@impl true
def mount(_params, _session, socket) do
{:ok,
socket
|> assign(:uploaded_files, [])
|> allow_upload(:pdf,
accept: ~w(.pdf),
max_entries: 1,
# 10MB limit
max_file_size: 10_000_000,
chunk_size: 64_000
)}
end
@impl true
def handle_event("validate", _params, socket) do
{:noreply, socket}
end
@impl true
def handle_event("save", _params, socket) do
uploaded_files =
consume_uploaded_entries(socket, :pdf, fn %{path: path}, _entry ->
dest = Path.join(["priv", "static", "uploads", Path.basename(path)])
File.cp!(path, dest)
{:ok, dest}
end)
pdf_document =
uploaded_files
|> hd()
{:noreply,
socket
|> assign(:pdf_document, pdf_document)
|> update(:uploaded_files, &(&1 ++ uploaded_files))}
end
end
Alongside this we need to add a little bit of code to our router.ex
file to allow us to upload files.
scope "/", ElixirPdfWeb do
pipe_through :browser
live "/", HomeLive
end
We also need a simple LiveView template to allow users to upload a PDF file and see the results.
<div class="mx-auto max-w-2xl py-8">
<div class="flex flex-col items-center justify-center">
<h1 class="text-2xl font-bold mb-8">Upload PDF</h1>
<form phx-submit="save" phx-change="validate" class="w-full">
<div class="flex flex-col items-center space-y-4 w-full" phx-drop-target={@uploads.pdf.ref}>
<div class="w-full border-2 border-dashed border-gray-300 rounded-lg p-12 text-center hover:border-gray-400 transition-colors">
<div class="space-y-2">
<div class="text-gray-600">
Drag and drop your PDF here or
<label class="cursor-pointer text-blue-500 hover:text-blue-600">
browse <.live_file_input upload={@uploads.pdf} class="hidden" />
</label>
</div>
<p class="text-xs text-gray-500">PDF files only, up to 10MB</p>
</div>
</div>
<%= for entry <- @uploads.pdf.entries do %>
<div class="w-full">
<div class="flex items-center justify-between p-4 bg-gray-50 rounded">
<div class="flex items-center space-x-2">
<span class="font-medium">{entry.client_name}</span>
<span class="text-sm text-gray-500">
({entry.client_size}B)
</span>
</div>
<button
type="button"
class="text-red-500 hover:text-red-700"
phx-click="cancel-upload"
phx-value-ref={entry.ref}
>
×
</button>
</div>
<%= for err <- upload_errors(@uploads.pdf, entry) do %>
<div class="text-red-500 text-sm">
{err}
</div>
<% end %>
</div>
<% end %>
<%= if length(@uploads.pdf.entries) > 0 do %>
<button
type="submit"
class="px-4 py-2 bg-blue-500 text-white rounded hover:bg-blue-600 transition-colors"
>
Upload
</button>
<% end %>
</div>
</form>
</div>
</div>
So we can upload a PDF file - but let's call our Rust function and see what it returns.
In our handle_event function, we can call our Rust function as simply as this:
@impl true
def handle_event("save", _params, socket) do
uploaded_files =
consume_uploaded_entries(socket, :pdf, fn %{path: path}, _entry ->
dest = Path.join(["priv", "static", "uploads", Path.basename(path)])
File.cp!(path, dest)
{:ok, dest}
end)
pdf_document =
uploaded_files
|> hd()
|> RustReader.extract_pdf() ## This is where the magic happens!
{:noreply,
socket
|> assign(:pdf_document, pdf_document)
|> update(:uploaded_files, &(&1 ++ uploaded_files))}
end
We're grabbing the first uploaded file and calling our Rust function. The result is a tuple containing the contents of the PDF and the metadata.
Let's try it out with the LiveView Cookbook PDF.
Success! We've now got a PDF parser that's fast, efficient, and written in Rust.
But that's quite hard to read so we're not done yet, let's make this a little nicer to work with.
Let's create a new module to handle the Jason encoding of the metadata.
defmodule ElixirPdf.PdfDocument do
@derive {Jason.Encoder, only: [:content, :metadata]}
defstruct [:content, :metadata]
def from_rustler({content, metadata_json}) do
with {:ok, metadata} <- Jason.decode(metadata_json) do
%__MODULE__{
content: String.trim(content),
metadata: metadata
}
end
end
end
This will allow us to encode the metadata to JSON and decode it back to a struct in Elixir to make it easier to work with.
All we have to do is pipe the result of our Rust function through this module and we're done!
...
RustReader.extract_pdf(@pdf_document)
|> ElixirPdf.PdfDocument.from_rustler()
...
Now when we upload a PDF file, we'll see the metadata results in a much more readable format.
Much better!
This approach is simple and effective - it's fast, efficient, and leverages the strengths of both Elixir and Rust to provide a robust solution for PDF parsing.
We're only talking about PDF files here but extractous supports a wide range of file types - so keep that in mind if you need to extract data from other file types.
Keep in mind that this is a native extension and so you'll need to build the Rust code before deploying your application. This can be done in a CI/CD pipeline or manually.
If you're using Docker, you can update the Dockerfile
to build the Rust code as part of the build process and update config/prod.exs
to tell Rustler to skip compilation and load the compiled NIF from where it was built in the Docker image.
Check out the Fly.io blog post for more information on how to deploy an Elixir application with Rust NIFs.
Some shoutouts are in order - firstly this blog post from Fly.io's Phoenix Files outlining how to use Rust with NIFs in Elixir. It was a key inspiration for this approach and gave me the idea to use Rust in the first place. Also check out Fly in general for some great Elixir hosting options - I use them for all my Elixir applications.
Also a shoutout for the excellent Extractous library which provides fast and efficient unstructured data extraction in Rust - it's also 25x faster than the very popular unstructured-io library.
Finally, a shoutout to the Rustler library for providing a simple way to call Rust code from Elixir!
All the code for this blog post can be found here on my Github if anyone wants to clone it and run it yourselves!
I hope you found this post useful, subscribe to my Substack below for similar content and follow me on Twitter and Bluesky for more Elixir (and general programming) tips.
If you're building a Phoenix project, I'd also encourage you to take a look at my open-source component library Bloom to help you out even further or check out my new voice to notes application to automatically tag and sync your voice to your calendar.
Want to learn and master LiveView?
Check out the book I'm writing
The Phoenix LiveView Cookbook