Issue 014

Parsing PDFs (and more) in Elixir using Rust

A Love Story Between Two Amazing Languages 🦀💜

Jan 29, 2025 · 13 minute read

Here's the thing about PDFs - they're complex beasts that require quite a bit of thinking to properly parse - they come in all shapes and sizes, and they can contain a lot of different types of data and formatting. 90% of the time, we just want to extract the text from the file, but that's not always easy - for the remaining 10%, well we won't be covering that in this blog post.

If you've been in the Elixir world for long enough, you'll probably have tried to parse a PDF file and realised that it's not as easy as it seems. A quick look on the Elixir Forum will quickly show you that there is no simple way to do it.

Most people will tell you to upload the file to S3 and use a Lambda to handle the contents. Offloading to AWS Lambda might seem elegant at first ("Look, Ma, no dependencies!"), but it comes with its own baggage:

You're adding network latency to what should be a simple operation
AWS costs can spiral if you're processing lots of PDFs
You're now dependent on external services for core functionality
Debugging becomes a distributed systems problem

These aren't ideal solutions - and software engineering is already made more complicated than it needs to be at times - we don't need to add more complexity to the mix.

We need a robust, native solution that plays nicely with the BEAM. So how do we do that?

Enter the crabs!

Elixir is my favourite language, but it can't do everything - web services, background jobs, and more are easy but sometimes we need a little help from our friends closer to the hardware for some of the tasks Elixir doesn't have a native solution for. That's where Rust and NIFs come in!

Rust is a systems programming language that is fast, safe, and easy to use. It's a great language for writing code that needs to be performant and reliable.

But Rust isn't just fast - it's "zero-cost abstractions" fast. What does that mean? You get high-level, ergonomic code that compiles down to something as efficient as hand-written C. For PDF parsing, where you're dealing with complex file formats and potentially large documents, this performance is a game-changer.

What is a NIF? A NIF (Native Implemented Function) is a way to call Rust code from Elixir - it's the BEAMs method of allowing processes to directly call native functions. It allows you to write code in Rust that can be called directly from Elixir, giving you the performance benefits of Rust without sacrificing the ease of use of Elixir.

For this blog post, we're going to be using the Extractous library which provides fast and efficient unstructured data extraction in Rust. This combined with the NIFs in Elixir gives us a powerful combination for parsing PDFs.

The Setup

First things first, ensure you have Elixir and Rust installed on your machine.

Let's begin by creating a new LiveView Elixir application that will allow users to upload a PDF file and see a breakdown of the contents. We won't be needing any database functionality for this so we can use the --no-ecto flag to skip the database setup.

mix phx.new elixir_pdf --no-ecto

We'll also need to add the rustler dependency to our mix.exs file so we can call Rust code from Elixir.

defp deps do
  [
    {:rustler, "~> 0.27.0"}
  ]
end

Once we pull down our dependencies using mix deps.get, we can use mix rustler.new to generate our new Rust project in our code.

If you head to lib/elixir_pdf/<name_of_your_rust_project>.ex, you'll see that it's already generated a basic NIF for us. A default NIF implementation is provided for us, but we'll be implementing our own in the next step. I've named my Rust project rustreader for this example.

defmodule RustReader do
  use Rustler, otp_app: :elixir_pdf, crate: "rustreader"

  # Define the function that will be implemented in Rust
  def extract_pdf(_path), do: :erlang.nif_error(:nif_not_loaded)
end

Now, let's grab the extractous library and add it to our native/rustreader/Cargo.toml file - this will allow us to use the extractous library in our Rust code.

[dependencies]
rustler = "0.36.0"
extractous = "0.2.0"

With this in place, we can run cargo build to build our Rust code - this will also pull down the extractous library and any other dependencies.

The fun part - writing some code

Next we need to actually write some Rust code to implement the extract_pdf function in our native/rustreader/src/lib.rs file.

use extractous::Extractor;
use rustler::{Encoder, Env, NifResult, Term};

#[rustler::nif(schedule = "DirtyCpu")]
fn extract_pdf(path: String) -> NifResult<(String, String)> {
    let extractor = Extractor::new();

    match extractor.extract_file_to_string(&path) {
        Ok((content, metadata)) => Ok((content, format!("{:?}", metadata))),
        Err(e) => Err(rustler::Error::Term(Box::new(format!("Extraction failed: {}", e))))
    }
}

rustler::init!("Elixir.RustReader", [extract_pdf]);

This code will define a new instance of the Extractor struct and use it to extract the contents of the PDF file. We'll then return the contents and the metadata as a tuple.

The magic of the rustler::init! macro is that it will automatically generate the necessary code to call the Rust function from Elixir.

Astute observers will note our use of the DirtyCpu schedule. This ingenious feature instructs Rustler and the BEAM to automatically schedule our task in a manner that prevents global blocking during execution. This functionality, known as a DirtyNif, significantly simplifies our work compared to the complexities of manual implementation in C.

Now we need to write a some simple LiveView Elixir code to allow users to upload a PDF file and then call our Rust function from the server.

defmodule ElixirPdfWeb.HomeLive do
  use ElixirPdfWeb, :live_view

  @impl true
  def mount(_params, _session, socket) do
    {:ok,
     socket
     |> assign(:uploaded_files, [])
     |> allow_upload(:pdf,
       accept: ~w(.pdf),
       max_entries: 1,
       # 10MB limit
       max_file_size: 10_000_000,
       chunk_size: 64_000
     )}
  end

  @impl true
  def handle_event("validate", _params, socket) do
    {:noreply, socket}
  end

  @impl true
  def handle_event("save", _params, socket) do
    uploaded_files =
      consume_uploaded_entries(socket, :pdf, fn %{path: path}, _entry ->
        dest = Path.join(["priv", "static", "uploads", Path.basename(path)])
        File.cp!(path, dest)
        {:ok, dest}
      end)

    pdf_document =
      uploaded_files
      |> hd()

    {:noreply,
     socket
     |> assign(:pdf_document, pdf_document)
     |> update(:uploaded_files, &(&1 ++ uploaded_files))}
  end
end

Alongside this we need to add a little bit of code to our router.ex file to allow us to upload files.

scope "/", ElixirPdfWeb do
  pipe_through :browser

  live "/", HomeLive
end

We also need a simple LiveView template to allow users to upload a PDF file and see the results.

<div class="mx-auto max-w-2xl py-8">
  <div class="flex flex-col items-center justify-center">
    <h1 class="text-2xl font-bold mb-8">Upload PDF</h1>

    <form phx-submit="save" phx-change="validate" class="w-full">
      <div class="flex flex-col items-center space-y-4 w-full" phx-drop-target={@uploads.pdf.ref}>
        <div class="w-full border-2 border-dashed border-gray-300 rounded-lg p-12 text-center hover:border-gray-400 transition-colors">
          <div class="space-y-2">
            <div class="text-gray-600">
              Drag and drop your PDF here or
              <label class="cursor-pointer text-blue-500 hover:text-blue-600">
                browse <.live_file_input upload={@uploads.pdf} class="hidden" />
              </label>
            </div>
            <p class="text-xs text-gray-500">PDF files only, up to 10MB</p>
          </div>
        </div>

        <%= for entry <- @uploads.pdf.entries do %>
          <div class="w-full">
            <div class="flex items-center justify-between p-4 bg-gray-50 rounded">
              <div class="flex items-center space-x-2">
                <span class="font-medium">{entry.client_name}</span>
                <span class="text-sm text-gray-500">
                  ({entry.client_size}B)
                </span>
              </div>

              <button
                type="button"
                class="text-red-500 hover:text-red-700"
                phx-click="cancel-upload"
                phx-value-ref={entry.ref}
              >
                &times;
              </button>
            </div>

            <%= for err <- upload_errors(@uploads.pdf, entry) do %>
              <div class="text-red-500 text-sm">
                {err}
              </div>
            <% end %>
          </div>
        <% end %>

        <%= if length(@uploads.pdf.entries) > 0 do %>
          <button
            type="submit"
            class="px-4 py-2 bg-blue-500 text-white rounded hover:bg-blue-600 transition-colors"
          >
            Upload
          </button>
        <% end %>
      </div>
    </form>
  </div>
</div>

Putting it all together

So we can upload a PDF file - but let's call our Rust function and see what it returns.

In our handle_event function, we can call our Rust function as simply as this:

  @impl true
  def handle_event("save", _params, socket) do
    uploaded_files =
      consume_uploaded_entries(socket, :pdf, fn %{path: path}, _entry ->
        dest = Path.join(["priv", "static", "uploads", Path.basename(path)])
        File.cp!(path, dest)
        {:ok, dest}
      end)

    pdf_document =
      uploaded_files
      |> hd()
      |> RustReader.extract_pdf() ## This is where the magic happens!

    {:noreply,
     socket
     |> assign(:pdf_document, pdf_document)
     |> update(:uploaded_files, &(&1 ++ uploaded_files))}
  end

We're grabbing the first uploaded file and calling our Rust function. The result is a tuple containing the contents of the PDF and the metadata.

Let's try it out with the LiveView Cookbook PDF.

Unstructured

Success! We've now got a PDF parser that's fast, efficient, and written in Rust.

But that's quite hard to read so we're not done yet, let's make this a little nicer to work with.

Let's create a new module to handle the Jason encoding of the metadata.

defmodule ElixirPdf.PdfDocument do
  @derive {Jason.Encoder, only: [:content, :metadata]}
  defstruct [:content, :metadata]

  def from_rustler({content, metadata_json}) do
    with {:ok, metadata} <- Jason.decode(metadata_json) do
      %__MODULE__{
        content: String.trim(content),
        metadata: metadata
      }
    end
  end
end

This will allow us to encode the metadata to JSON and decode it back to a struct in Elixir to make it easier to work with.

All we have to do is pipe the result of our Rust function through this module and we're done!

...
RustReader.extract_pdf(@pdf_document)
|> ElixirPdf.PdfDocument.from_rustler()
...

Now when we upload a PDF file, we'll see the metadata results in a much more readable format.

Structured

Much better!

This approach is simple and effective - it's fast, efficient, and leverages the strengths of both Elixir and Rust to provide a robust solution for PDF parsing.

We're only talking about PDF files here but extractous supports a wide range of file types - so keep that in mind if you need to extract data from other file types.

What about deployment?

Keep in mind that this is a native extension and so you'll need to build the Rust code before deploying your application. This can be done in a CI/CD pipeline or manually.

If you're using Docker, you can update the Dockerfile to build the Rust code as part of the build process and update config/prod.exs to tell Rustler to skip compilation and load the compiled NIF from where it was built in the Docker image.

Check out the Fly.io blog post for more information on how to deploy an Elixir application with Rust NIFs.

Shoutouts

Some shoutouts are in order - firstly this blog post from Fly.io's Phoenix Files outlining how to use Rust with NIFs in Elixir. It was a key inspiration for this approach and gave me the idea to use Rust in the first place. Also check out Fly in general for some great Elixir hosting options - I use them for all my Elixir applications.

Also a shoutout for the excellent Extractous library which provides fast and efficient unstructured data extraction in Rust - it's also 25x faster than the very popular unstructured-io library.

Finally, a shoutout to the Rustler library for providing a simple way to call Rust code from Elixir!

All the code for this blog post can be found here on my Github if anyone wants to clone it and run it yourselves!

I hope you found this post useful, subscribe to my Substack below for similar content and follow me on Twitter and Bluesky for more Elixir (and general programming) tips.

If you're building a Phoenix project, I'd also encourage you to take a look at my open-source component library Bloom to help you out even further or check out my new voice to notes application to automatically tag and sync your voice to your calendar.

Enjoyed this content?

Want to learn and master LiveView?

Check out the book I'm writing

The Phoenix LiveView Cookbook

fin

Share to Twitter Share to Hacker News Share to LinkedIn