HTML to PDF using iTextSharp Library In Dot Net


Sample code snippet for converting HTML to PDF format. It also considers image tags in the HTML and uses iTextSharp library.


I have been looking around for a perfect example to do this for me but could not find anything which correctly solved my requirement. My requirement is pretty simple:

Create a PDF document from a HTML code (basically a blog post). The HTML code also contains images (img tags) with relative URL’s.

I did find valuable information from various places like
Kuujinbo and Hamang.net and was finally able to create a PDF using the following asp.net and c# code. I hope it will help you too.

Pre-Requisite

Download and copy iTextSharp.dll in the bin folder of your website. I have used version 5.1.1.

Issues and Solutions

The new iTextSharp library is pretty good in converting the HTML code to PDF in itself but lacks a major
function of mapping relative URL’s of images with the absolute ones. Hence, HTMLworker class would through an exception if your HTML code contains relative URL’s.

html to pdf with iTextSharp

html to pdf with iTextSharp

There are two ways of solving this issue.

  1. Use the IImageProvider interface to fetch the images from HTML code and then paste in PDF.The issue with this approach is that the image does not carry its styles like height or width that you would have specified in HTML code. The second option addresses this issue too.
  2. Parse the HTML code and replace the relative URL’s with absolute URL’s before writing it to PDF file.This solution preserves the height and width as specified in HTML code for the image. This is the approach thatI prefer. Nevertheless, i have provided code for both options below for you to take the final call.

Basic Set Up

Add a new page in your code i.e. add PostToPDF_AM22.aspx and PostToPDF_AM22.aspx.cs.

Code for Option 1 (using IImageProvider):

Write the following code in PostToPDF_AM22.aspx.cs:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Web;
    using System.Web.UI;
    using System.Web.UI.WebControls;

    //For converting HTML TO PDF- START
    using iTextSharp.text;
    using iTextSharp.text.html;
    using iTextSharp.text.pdf;
    using iTextSharp.text.xml;
    using iTextSharp.text.html.simpleparser;
    using System.IO;
    using System.util;
    using System.Text.RegularExpressions;
    //For converting HTML TO PDF- END

    public partial class PostToPDF_AM22 : System.Web.UI.Page
    {
    protected void Page_Load(object sender, EventArgs e)
    {
    //Get the HTML code from your database or whereever you have stored it and store
    //it in HTMLCode variable.
    string HTMLCode = string.Empty;
    ConvertHTMLToPDF(HTMLCode);
    }
    protected void ConvertHTMLToPDF(string HTMLCode)
    {
    HttpContext context = HttpContext.Current;

    //Render PlaceHolder to temporary stream
    System.IO.StringWriter stringWrite = new StringWriter();
    System.Web.UI.HtmlTextWriter htmlWrite = new HtmlTextWriter(stringWrite);

    StringReader reader = new StringReader(HTMLCode);

    //Create PDF document
    Document doc = new Document(PageSize.A4);
    HTMLWorker parser = new HTMLWorker(doc);
    PdfWriter.GetInstance(doc, new FileStream(Server.MapPath("~") + "/App_Data/HTMLToPDF.pdf",

    FileMode.Create));
    doc.Open();

    /********************************************************************************/
    var interfaceProps = new Dictionary<string, Object>();
    var ih = new ImageHander() { BaseUri = Request.Url.ToString() };

    interfaceProps.Add(HTMLWorker.IMG_PROVIDER, ih);

    foreach (IElement element in HTMLWorker.ParseToList(
    new StringReader(HTMLCode), null))
    {
    doc.Add(element);
    }
    doc.Close();
    Response.End();

    /********************************************************************************/

    }

    //handle Image relative and absolute URL's
    public class ImageHander : IImageProvider
    {
    public string BaseUri;
    public iTextSharp.text.Image GetImage(string src,
    IDictionary<string, string> h,
    ChainedProperties cprops,
    IDocListener doc)
    {
    string imgPath = string.Empty;

    if (src.ToLower().Contains("http://") == false)
    {
    imgPath = HttpContext.Current.Request.Url.Scheme + "://" +

    HttpContext.Current.Request.Url.Authority + src;
    }
    else
    {
    imgPath = src;
    }

    return iTextSharp.text.Image.GetInstance(imgPath);
    }
    }
    }

Code for Option 2 (using parser to convert relative URL’s):

Write the following code in PostToPDF_AM22.aspx.cs:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Web;
    using System.Web.UI;
    using System.Web.UI.WebControls;

    //For converting HTML TO PDF- START
    using iTextSharp.text;
    using iTextSharp.text.html;
    using iTextSharp.text.pdf;
    using iTextSharp.text.xml;
    using iTextSharp.text.html.simpleparser;
    using System.IO;
    using System.util;
    using System.Text.RegularExpressions;
    //For converting HTML TO PDF- END

    public partial class PostToPDF_AM22 : System.Web.UI.Page
    {
    protected void Page_Load(object sender, EventArgs e)
    {
    //Get the HTML code from your database or whereever you have stored it and store
    //it in HTMLCode variable.
    string HTMLCode = string.Empty;
    ConvertHTMLToPDF(HTMLCode);
    }
    protected void ConvertHTMLToPDF(string HTMLCode)
    {
    HttpContext context = HttpContext.Current;

    //Render PlaceHolder to temporary stream
    System.IO.StringWriter stringWrite = new StringWriter();
    System.Web.UI.HtmlTextWriter htmlWrite = new HtmlTextWriter(stringWrite);

    /********************************************************************************/
    //Try adding source strings for each image in content
    string tempPostContent = getImage(HTMLCode);
    /*********************************************************************************/

    StringReader reader = new StringReader(tempPostContent);

    //Create PDF document
    Document doc = new Document(PageSize.A4);
    HTMLWorker parser = new HTMLWorker(doc);
    PdfWriter.GetInstance(doc, new FileStream(Server.MapPath("~") + "/App_Data/HTMLToPDF.pdf",

    FileMode.Create));
    doc.Open();

    try
    {
    //Parse Html and dump the result in PDF file
    parser.Parse(reader);
    }
    catch (Exception ex)
    {
    //Display parser errors in PDF.
    Paragraph paragraph = new Paragraph("Error!" + ex.Message);
    Chunk text = paragraph.Chunks[0] as Chunk;
    if (text != null)
    {
    text.Font.Color = BaseColor.RED;
    }
    doc.Add(paragraph);
    }
    finally
    {
    doc.Close();
    }
    }

    public string getImage(string input)
    {
    if (input == null)
    return string.Empty;
    string tempInput = input;
    string pattern = @"<img(.|\n)+?>";
    string src = string.Empty;
    HttpContext context = HttpContext.Current;

    //Change the relative URL's to absolute URL's for an image, if any in the HTML code.
    foreach (Match m in Regex.Matches(input, pattern, RegexOptions.IgnoreCase | RegexOptions.Multiline |

    RegexOptions.RightToLeft))
    {
    if (m.Success)
    {
    string tempM = m.Value;
    string pattern1 = "src=[\'|\"](.+?)[\'|\"]";
    Regex reImg = new Regex(pattern1, RegexOptions.IgnoreCase | RegexOptions.Multiline);
    Match mImg = reImg.Match(m.Value);

    if (mImg.Success)
    {
    src = mImg.Value.ToLower().Replace("src=", "").Replace("\"", "");

    if (src.ToLower().Contains("http://") == false)
    {
    //Insert new URL in img tag
    src = "src=\"" + context.Request.Url.Scheme + "://" +
    context.Request.Url.Authority + src + "\"";
    try
    {
    tempM = tempM.Remove(mImg.Index, mImg.Length);
    tempM = tempM.Insert(mImg.Index, src);

    //insert new url img tag in whole html code
    tempInput = tempInput.Remove(m.Index, m.Length);
    tempInput = tempInput.Insert(m.Index, tempM);
    }
    catch (Exception e)
    {

    }
    }
    }
    }
    }
    return tempInput;
    }

    string getSrc(string input)
    {
    string pattern = "src=[\'|\"](.+?)[\'|\"]";
    System.Text.RegularExpressions.Regex reImg = new System.Text.RegularExpressions.Regex(pattern,
    System.Text.RegularExpressions.RegexOptions.IgnoreCase |

    System.Text.RegularExpressions.RegexOptions.Multiline);
    System.Text.RegularExpressions.Match mImg = reImg.Match(input);
    if (mImg.Success)
    {
    return mImg.Value.Replace("src=", "").Replace("\"", ""); ;
    }

    return string.Empty;
    }
    }

Explanation:

In both the above cases, there is a function ConvertHTMLToPDF which takes HTML as input and converts it to PDF format.

The result is stored in a PDF file named HTMLToPDF.pdf in your web site’s App_Data folder.

Remember that you need to write the code to fetch the HTML code from your database or any other file in the

Page_Load event of the above page. Pass this HTML Code to the converter function and it will create the PDF file for you.

If you face any issue, do write in to comments and I will try to help you ASAP.


  • shekhar

    doc.Close() throws error… “doc does not contain a page”.

  • shekhar

    Also the class HTMLWorker is obsolete…

  • Hi Shekhar,

    Yes, you may be right. This article and code is pretty old now and the iTextSharp library may have changed by now.
    We have not updated the new code.

    You can try replacing the HTMLWorker class with the relevant new class from iTextSharp.

  • doc.close() would throw this error if the doc object was never initialized.

    You can add a statement like
    if(doc != null)
    {
    doc.close()
    }
    to counter that scenario.

  • Jvniolz

    Hi Anil,
    Sorry i have a Problem… i’m using only this code in my project:

    protected void ConvertHTMLToPDF(string HTMLCode)
    {
    HttpContext context = HttpContext.Current;
    System.IO.StringWriter stringWrite = new StringWriter();
    System.Web.UI.HtmlTextWriter htmlWrite = new HtmlTextWriter(stringWrite);
    StringReader reader = new StringReader(HTMLCode);
    //Creamos el documento PDF
    Document doc = new Document(PageSize.A4);
    HTMLWorker parser = new HTMLWorker(doc);
    PdfWriter.GetInstance(doc,new FileStream(“C://HTMLToPDF.pdf”,FileMode.Create));
    doc.Open();

    But when i am trying to open the file i get this error message : “The file is damaged and couldn’t be repaired” … I can see the file into “C://” but can’t open it…. Please Can you help ASAP !! … i am trying to generate a PDF File when i make a Click in a ImagenButton.
    PD: Sorry for my english, maybe isn’t the better :D!
    Thanx.

  • Hi,

    I see 2 issues with your code:

    1. Are you running this code as a web application or windows application? If it is a web application, you have to provide the path relative to your HTTP context root (Server.MapPath function).

    Use the following;
    PdfWriter.GetInstance(doc, new FileStream(Server.MapPath(“~”) + “/HTMLToPDF.pdf”, FileMode.Create));

    2. You are trying to open an existing file in CREATE mode. The file should NOT exist since create needs to create a new file for you with the name that you are passing.

    Let me know if you need more information.

  • Eduardo Campos Castillo

    Thnx for ur answer.

    I am using a web application, Please a i have two questions..

    1. I need to put some object for the client be able to chose where they want to save the file, is it posible? and can give an example please!.

    2.to solve the point two i need to do any change in my code? please help me.

    PD: If you can help me by Skype, Whatsapp or anyway tell me please.

    Thanx so much.

  • Eduardo Campos Castillo

    Hi anil,

    I tryed with the new code but i and open the file now… because i get this error : “File is openned or is be using for other application” .. i can see the file and this has 0 Byte.. can u help me please?.

    Thanx again.

  • Eduardo Campos Castillo

    I have a additional question… i need to put any objects/code for the client chose where he wants save the file.. is it posible? .. can you give any example please.. thnx.

  • Eduardo Campos Castillo

    Hi anil ! .. i got the solution! Thanx!! .. Can u help me with this…. ¿is there any method to get the html code from the page where i am now?. i need to punt a Button to download all the information that client is watching in a specific page.

    Please help me again . ! Thnks !

  • You need to look into the httpWebResponse class to fetch the HTML code of a web page. Read about this class and you will find relevant code as well on internet.

    Check out this link:
    http://msdn.microsoft.com/en-us/library/system.net.httpwebresponse.getresponsestream(v=vs.110).aspx

  • Eduardo Campos Castillo

    HI Anil , please can u help me again :$.

    I could export a pdf and get the html source of my webpage, but i have a problem because my Var HTMLCode get the HTML source as a string value ( i am sure about it because i saw the value of HTMLCode and put this into a txt file and next change the extencion to html, when i start the html file i got the webpage.) but when i export the PDF i only get this message into my PDF
    RepJunior.cb {FONT-FAMILY: “WASP 39 MC”;FONT-SIZE:24pt;FONT-WEIGHT:normal}

    Please can you help to solve this problem i can understand why if i got the correct HTML Source when i put this HTML code into the StreamReader only got this text into my PDF.

    I test my project with simple HTML source like a header, some text and i got a PDF with this information. Pleae help and say me if u need any more to do it.

    thanx again.

  • This looks like a problem with the HTML parser from iTextSharp library.
    The problem is the embedded CSS within the HTML tags which is causing the iTextSharp parser to fail.

    Try this:
    Remove all the embedded CSS from the HTML code and then run it again. I am sure it will create the PDF.

  • Rupesh

    Hi Anil,

    is there any way to parse CSS embedded HTML document to PDF, using iTextsharp HTMLWorker or any other option?

    please update me, its urgent.

    Thanks,

  • Hi Rupesh,
    I did try parsing the CSS but it does not work. In-fact, I did not use this method of converting html to PDF after writing the code in my project.
    It did work for the HTML part but does not implement the CSS at all.

    The other approach that I used was to use the browser’s own print function. Once the user clicks the print button on a page, the browser provides the option of saving it as PDF which maintains all the CSS decoration. This is implemented using a specific print.css file and media queries. You can see it in action in the above article (Click the print button on left hand side).

    Off-course, this method would work on individual web page. If your requirement is to automate it on the server side, I really do not know if there is any other open source and free library available at this time.

    Let me know if you need more information.

  • C0D3R

    Hi Anil, I have been searching for a proper HTML to PDF qute a long time and now I found you. I need to implement it for a Windows Application. How do I proceed?

  • You can just start with including the itextSharp DLL in your windows application as a reference and start using their classes as mentioned int he above code.

    Let me know if you face any issue there.

  • Mukesh Sagar

    How to display the image stored as 64 bit data uri, and save it in pdf file

  • Hi Mukesh,
    You probably need to find a library (DLL) whch can read the image binary data and then convert it into an image file. Then, you can embed in it PDF, may be using the above code.
    I have not used any such library.

  • elaine

    hello Anil, I’m trying to convert asp.net webpage in pdf using itextsharp. I just get “HTMLWorker obsolet” and it does not work. Is there any turnaround? My itextsharp version 5.5.3.

  • I think the HTMLWorker class has been removed in new version.

    You can use XMLWorker class instead. I will have to update the above article for the new class. Will plan to do it soon.

  • rupesh

    Hi Anil,
    i am facing problem to parse the xml data file using xslt, because of absolute path in xml data.
    eg:
    test.xml

    C:tempimage1.jpg
    image2.jpg

    test.xslt

    (here don’t know how to read image1.jpg path)

    could you please help me, how to proceed?
    @ if anyone have idea, reply please.

    Thanks,

  • I am not an expert in xslt…and in fact i have never worked on it.
    But, your question sounds like matching the regex for file path.

    If I am correct, then you should use (.*) for matching the path.

  • Sampath Kumar

    Anil Thank you So much brother………………………..

  • Sampath Kumar

    perfect with
    Option 2 (using parser to convert relative URL’s):

  • You are welcome

  • Sampath Kumar

    Hi,Anil,

    Few days back code works fine.
    Today when i try to run the same code throwing an exception when it enter into the
    parser.Parse(reader);

    Error Message is : The remote server returned an error: (404) Not Found.

    any suggestion/solution for this error please.
    Thank you.

  • Hi Sampath,

    The error is about the HTML code that you are trying to fetch and not your HTML to PDF code. The HTML page that you are trying to fetch is NOT available. 404 error is a web error which says that the link (or URL) that you are using to fetch the HTML page is NOT available.

    You should check the web page link that you are trying to fetch.

  • Sampath Kumar

    Tq Anil

  • sanjiv

    Hi Anil,

    I have following code:

    string HTMLCode = “Link Nameis a link to another nifty siteThis is a HeaderThis is a Medium HeaderSend me mail at support@yourcompany.com. This is a new paragraph! This is a new paragraph!This is a new sentence without a paragraph break, in bold italics.“;

    I am getting error: Error!Object reference not set to an instance of an object.
    Please help me.

    Regards,
    Sanjiv

  • Hi Sanjiv,
    I can’t solve this error by just looking at the html string (you have just shared the string and nothing else).

    The error simply says that some of the object (a variable) that you have in your code has not been properly instantiated.

  • Thanuja

    When I use, xml worker instead of html parser, I get document ahs no pages error.

  • XML parser may not be able to work with HTML code. They are two different structures.

  • santhosh doddamallappa shaku

    html had images but after parser.parse the pdf doesn’t have the image in it

  • santhosh doddamallappa shaku

    HI Anil,

    when Debuggin i can see image in tempimput template but after parse the pdf is missing image.please help

 Updated

 Tagged In


© Copyright   AM22 Tech
Follow Us
     


This story was originally published at AM22 Tech. This printer-friendly version is made available for your personal and non-commercial use only.
  
© Copyright 2017   AM22 Tech