Read PDF file using Javascript

: 10783

Last Updated : 23/09/2022

Posted By :- vikas_jk

In one of our previous post we have explained about how to read excel using Javascript and read csv using Javascript, but in this post I have provided working example to read content of pdf file in Javascript. For this example, we will be using PDF.js to extract pdf content.

Read PDF text using JavaScript

As stated above, we will be using pdf.js for reading pdf file using Javascript, for this we will be using pdf.js 1.10 version, which is much easier to use and stable, here are the steps which we will be taking to read pdf contents.

First, we will convert PDF file contents into ArrayBuffer
ArrayBuffer is passed to PDF.js, and read text using getDocument()
Each page is data is extracted using getPage()
Each page text is extracted using textContent.items

Let's begin by adding require Javscript file and creating required HTMl to browse PDF file

    <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/1.10.100/pdf.min.js" ></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.6.347/pdf.worker.entry.min.js" ></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/1.10.100/pdf.worker.min.js" ></script>
    <input type="file" id="file-id" name="file_name" onchange="ExtractText();">


    <!-- a container for the output -->
    <div id="output"></div>

Once the file is browsed and selected we are calling JS function ExtractText()

Here is the complete Javascript, code which will be used

        var datass = '';
        var DataArr = [];
        PDFJS.workerSrc = '';

        function ExtractText() {
            var input = document.getElementById("file-id");
            var fReader = new FileReader();
            fReader.readAsDataURL(input.files[0]);
            // console.log(input.files[0]);
            fReader.onloadend = function (event) {
                convertDataURIToBinary(event.target.result);
            }
        }

        var BASE64_MARKER = ';base64,';

        function convertDataURIToBinary(dataURI) {

            var base64Index = dataURI.indexOf(BASE64_MARKER) + BASE64_MARKER.length;
            var base64 = dataURI.substring(base64Index);
            var raw = window.atob(base64);
            var rawLength = raw.length;
            var array = new Uint8Array(new ArrayBuffer(rawLength));

            for (var i = 0; i < rawLength; i++) {
                array[i] = raw.charCodeAt(i);
            }
            pdfAsArray(array)

        }

        function getPageText(pageNum, PDFDocumentInstance) {
            // Return a Promise that is solved once the text of the page is retrieven
            return new Promise(function (resolve, reject) {
                PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
                    // The main trick to obtain the text of the PDF page, use the getTextContent method
                    pdfPage.getTextContent().then(function (textContent) {
                        var textItems = textContent.items;
                        var finalString = "";

                        // Concatenate the string of the item to the final string
                        for (var i = 0; i < textItems.length; i++) {
                            var item = textItems[i];

                            finalString += item.str + " ";
                        }

                        // Solve promise with the text retrieven from the page
                        resolve(finalString);
                    });
                });
            });
        }

        function pdfAsArray(pdfAsArray) {

            PDFJS.getDocument(pdfAsArray).then(function (pdf) {

                var pdfDocument = pdf;
                // Create an array that will contain our promises
                var pagesPromises = [];

                for (var i = 0; i < pdf.pdfInfo.numPages; i++) {
                    // Required to prevent that i is always the total of pages
                    (function (pageNumber) {
                        // Store the promise of getPageText that returns the text of a page
                        pagesPromises.push(getPageText(pageNumber, pdfDocument));
                    })(i + 1);
                }

                // Execute all the promises
                Promise.all(pagesPromises).then(function (pagesText) {

                    // Display text of all the pages in the console
                    // e.g ["Text content page 1", "Text content page 2", "Text content page 3" ... ]
                    console.log(pagesText); // representing every single page of PDF Document by array indexing
                    console.log(pagesText.length);
                    var outputStr = "";
                    for (var pageNum = 0; pageNum < pagesText.length; pageNum++) {
                        console.log(pagesText[pageNum]);
                        outputStr = "";
                        outputStr = "<br/><br/>Page " + (pageNum + 1) + " contents <br/> <br/>";

                        var div = document.getElementById('output');

                        div.innerHTML += (outputStr + pagesText[pageNum]);

                    }




                });

            }, function (reason) {
                // PDF loading error
                console.error(reason);
            });
        }

This is our Sample PDF which will use to test this example, it has 2 pages as shown in the below image

sample-pdf-file-page-contents-extract-using-javascript-min.png

I have explained many part of the code using comments.

Complete HTML/Javascript will look like this

<html>
<body>

    <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/1.10.100/pdf.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.6.347/pdf.worker.entry.min.js" ></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/1.10.100/pdf.worker.min.js"></script>
    <input type="file" id="file-id" name="file_name" onchange="ExtractText();">


    <!-- a container for the output -->
    <div id="output"></div>

    <script>
        var datass = '';
        var DataArr = [];
        PDFJS.workerSrc = '';

        function ExtractText() {
            var input = document.getElementById("file-id");
            var fReader = new FileReader();
            fReader.readAsDataURL(input.files[0]);
            // console.log(input.files[0]);
            fReader.onloadend = function (event) {
                convertDataURIToBinary(event.target.result);
            }
        }

        var BASE64_MARKER = ';base64,';

        function convertDataURIToBinary(dataURI) {

            var base64Index = dataURI.indexOf(BASE64_MARKER) + BASE64_MARKER.length;
            var base64 = dataURI.substring(base64Index);
            var raw = window.atob(base64);
            var rawLength = raw.length;
            var array = new Uint8Array(new ArrayBuffer(rawLength));

            for (var i = 0; i < rawLength; i++) {
                array[i] = raw.charCodeAt(i);
            }
            pdfAsArray(array)

        }

        function getPageText(pageNum, PDFDocumentInstance) {
            // Return a Promise that is solved once the text of the page is retrieven
            return new Promise(function (resolve, reject) {
                PDFDocumentInstance.getPage(pageNum).then(function (pdfPage) {
                    // The main trick to obtain the text of the PDF page, use the getTextContent method
                    pdfPage.getTextContent().then(function (textContent) {
                        var textItems = textContent.items;
                        var finalString = "";

                        // Concatenate the string of the item to the final string
                        for (var i = 0; i < textItems.length; i++) {
                            var item = textItems[i];

                            finalString += item.str + " ";
                        }

                        // Solve promise with the text retrieven from the page
                        resolve(finalString);
                    });
                });
            });
        }

        function pdfAsArray(pdfAsArray) {

            PDFJS.getDocument(pdfAsArray).then(function (pdf) {

                var pdfDocument = pdf;
                // Create an array that will contain our promises
                var pagesPromises = [];

                for (var i = 0; i < pdf.pdfInfo.numPages; i++) {
                    // Required to prevent that i is always the total of pages
                    (function (pageNumber) {
                        // Store the promise of getPageText that returns the text of a page
                        pagesPromises.push(getPageText(pageNumber, pdfDocument));
                    })(i + 1);
                }

                // Execute all the promises
                Promise.all(pagesPromises).then(function (pagesText) {

                    // Display text of all the pages in the console
                    // e.g ["Text content page 1", "Text content page 2", "Text content page 3" ... ]
                    console.log(pagesText); // representing every single page of PDF Document by array indexing
                    console.log(pagesText.length);
                    var outputStr = "";
                    for (var pageNum = 0; pageNum < pagesText.length; pageNum++) {
                        console.log(pagesText[pageNum]);
                        outputStr = "";
                        outputStr = "<br/><br/>Page " + (pageNum + 1) + " contents <br/> <br/>";

                        var div = document.getElementById('output');

                        div.innerHTML += (outputStr + pagesText[pageNum]);

                    }




                });

            }, function (reason) {
                // PDF loading error
                console.error(reason);
            });
        }

    </script>
</body>

</html>

Once we are done, we can use the above code in our browser, and you will see output as below

Complete Fiddle sample

As you can see from above example output, we were able to extract PDF contents using Javascript and show all the text.

You may also like to read:

Convert Image to base64 string using Javascript

Solving Error "JsonException: A possible object cycle was detected" .NET Core

Read PDF in C# using iTextSharp

Where can I download Visual Studio ISO?

Buy us a coffee

Become a Patron Q Follow our Quora space

Comment

Comment's

ViniciusSantos

Thanks.

3/9/2022 12:47:14 AM

Add Comment

vikas_jk

Hello, Thanks for your query.

This is the part of code which is fetching Text

 for (var pageNum = 0; pageNum < pagesText.length; pageNum++) {
                        console.log(pagesText[pageNum]);
                        outputStr = "";
                        outputStr = "<br/><br/>Page " + (pageNum + 1) + " contents <br/> <br/>";

                        var div = document.getElementById('output');

                        div.innerHTML += (outputStr + pagesText[pageNum]);

                    }

Modify this line of code, according to your requirements.

div.innerHTML += (outputStr + pagesText[pageNum]); // pageText[pageNum] is getting text of a page.

Thanks.

3/9/2022 10:54:24 AM

RickHellewell

Interesting code and concepts. Your source files work well (the Fiddle helped).

But not sure what to change if I want to read a PDF on my site. Setting this line doesn't work:

var pdffile='test.pdf'; // file in same folder as php file

fReader.readAsDataURL(pdffile);

gives me an error in the console

caught ReferenceError: require is not defined

<anonymous>

https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.6.347/pdf.worker.entry.min.js:1

How can I use a PDF on the server to read, rather than a local file?

Thanks...

7/1/2022 10:54:38 PM
Edited at :- 7/1/2022 10:54:57 PM

Add Comment

vikas_jk

Can you double check if you have refferenced all pdfjs library files correctly?

   <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/1.10.100/pdf.min.js"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.6.347/pdf.worker.entry.min.js" ></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/1.10.100/pdf.worker.min.js"></script>

Once you have done this, please check if you are able to get file data in your JS Code.

Just check file data or contents

            var input = document.getElementById("file-id");
            var fReader = new FileReader();
            fReader.readAsDataURL(input.files[0]);

since, file is already placed on server, I think you will need to place complete path.

var pdfFile = "../server-directory/appdirectory/pdfFile.pdf"

I believe you are not able to refer file correctly on server.

7/2/2022 11:21:52 AM

RickHellewell

I have doublechecked all the includes, they are as specified.

On page load, the error ('request' not found) is shown in the console:

Uncaught ReferenceError: require is not defined

<anonymous> https://cdnjs.cloudflare.com/ajax/libs/pdf.js/2.6.347/pdf.worker.entry.min.js:1

This happens even before the file is read, although a local PDF file is processed properly, even with that error.

I changed the line to access a server file:

Set up the $pdffile before the script:

$pdf_file = __DIR__.'/pdfs/winchester.pdf';

Change inside the script to use the full pdf file defined above

var pdffile = "<?php echo $pdf_file;?>";
console.log(pdffile);

The console.log shows the full path of the pdf file, which is verified to exist in that full page.

Using a button to start the conversion, I still get the above error (button code shown below)

<button id='readthefile' name='readthefile' onclick='ExtractText();'>read the file</button>

The code works for local file, but not for a file on the server.

7/2/2022 10:27:37 PM

Login or Register to comment

8153

Export html table to excel using jQuery / Javascript

9837

How to get user location using Javascript / HTML5 Geolocation?

3867

Implementing Javascript Drag and Drop using HTML5

5130

Making all fields of Form as Read only (Disabled) using jQuery on page load.

9858

Convert Text to HTML using Javascript

Subscribe Now

Subscribe to our weekly Newsletter & Keep getting latest article/questions in your inbox weekly

Read PDF file using Javascript

Read PDF text using JavaScript

Comment's

ViniciusSantos

vikas_jk

RickHellewell

vikas_jk

RickHellewell

Related Articles

Subscribe Now

Related Questions