html to pdf with iTextSharp
html to pdf with iTextSharp

I have been looking around for a perfect example to do this for me but could not find anything which correctly solved my requirement. My requirement is pretty simple:

Create a PDF document from a HTML code (basically a blog post). The HTML code also contains images (img tags) with relative URL’s.

I did find valuable information from various places like
Kuujinbo and Hamang.net and was finally able to create a PDF using the following asp.net and c# code. I hope it will help you too.

Pre-Requisite

Download and copy iTextSharp.dll in the bin folder of your website. I have used version 5.1.1.

Issues and Solutions

The new iTextSharp library is pretty good in converting the HTML code to PDF in itself but lacks a major
function of mapping relative URL’s of images with the absolute ones. Hence, HTMLworker class would through an exception if your HTML code contains relative URL’s.

There are two ways of solving this issue.

  1. Use the IImageProvider interface to fetch the images from HTML code and then paste in PDF.The issue with this approach is that the image does not carry its styles like height or width that you would have specified in HTML code. The second option addresses this issue too.
  2. Parse the HTML code and replace the relative URL’s with absolute URL’s before writing it to PDF file.This solution preserves the height and width as specified in HTML code for the image. This is the approach thatI prefer. Nevertheless, i have provided code for both options below for you to take the final call.

Basic Set Up

Add a new page in your code i.e. add PostToPDF_AM22.aspx and PostToPDF_AM22.aspx.cs.

Code for Option 1 (using IImageProvider):

Write the following code in PostToPDF_AM22.aspx.cs:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Web;
    using System.Web.UI;
    using System.Web.UI.WebControls;

    //For converting HTML TO PDF- START
    using iTextSharp.text;
    using iTextSharp.text.html;
    using iTextSharp.text.pdf;
    using iTextSharp.text.xml;
    using iTextSharp.text.html.simpleparser;
    using System.IO;
    using System.util;
    using System.Text.RegularExpressions;
    //For converting HTML TO PDF- END

    public partial class PostToPDF_AM22 : System.Web.UI.Page
    {
    protected void Page_Load(object sender, EventArgs e)
    {
    //Get the HTML code from your database or whereever you have stored it and store
    //it in HTMLCode variable.
    string HTMLCode = string.Empty;
    ConvertHTMLToPDF(HTMLCode);
    }
    protected void ConvertHTMLToPDF(string HTMLCode)
    {
    HttpContext context = HttpContext.Current;

    //Render PlaceHolder to temporary stream
    System.IO.StringWriter stringWrite = new StringWriter();
    System.Web.UI.HtmlTextWriter htmlWrite = new HtmlTextWriter(stringWrite);

    StringReader reader = new StringReader(HTMLCode);

    //Create PDF document
    Document doc = new Document(PageSize.A4);
    HTMLWorker parser = new HTMLWorker(doc);
    PdfWriter.GetInstance(doc, new FileStream(Server.MapPath("~") + "/App_Data/HTMLToPDF.pdf",

    FileMode.Create));
    doc.Open();

    /********************************************************************************/
    var interfaceProps = new Dictionary<string, Object>();
    var ih = new ImageHander() { BaseUri = Request.Url.ToString() };

    interfaceProps.Add(HTMLWorker.IMG_PROVIDER, ih);

    foreach (IElement element in HTMLWorker.ParseToList(
    new StringReader(HTMLCode), null))
    {
    doc.Add(element);
    }
    doc.Close();
    Response.End();

    /********************************************************************************/

    }

    //handle Image relative and absolute URL's
    public class ImageHander : IImageProvider
    {
    public string BaseUri;
    public iTextSharp.text.Image GetImage(string src,
    IDictionary<string, string> h,
    ChainedProperties cprops,
    IDocListener doc)
    {
    string imgPath = string.Empty;

    if (src.ToLower().Contains("http://") == false)
    {
    imgPath = HttpContext.Current.Request.Url.Scheme + "://" +

    HttpContext.Current.Request.Url.Authority + src;
    }
    else
    {
    imgPath = src;
    }

    return iTextSharp.text.Image.GetInstance(imgPath);
    }
    }
    }

Code for Option 2 (using parser to convert relative URL’s):

Write the following code in PostToPDF_AM22.aspx.cs:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Web;
    using System.Web.UI;
    using System.Web.UI.WebControls;

    //For converting HTML TO PDF- START
    using iTextSharp.text;
    using iTextSharp.text.html;
    using iTextSharp.text.pdf;
    using iTextSharp.text.xml;
    using iTextSharp.text.html.simpleparser;
    using System.IO;
    using System.util;
    using System.Text.RegularExpressions;
    //For converting HTML TO PDF- END

    public partial class PostToPDF_AM22 : System.Web.UI.Page
    {
    protected void Page_Load(object sender, EventArgs e)
    {
    //Get the HTML code from your database or whereever you have stored it and store
    //it in HTMLCode variable.
    string HTMLCode = string.Empty;
    ConvertHTMLToPDF(HTMLCode);
    }
    protected void ConvertHTMLToPDF(string HTMLCode)
    {
    HttpContext context = HttpContext.Current;

    //Render PlaceHolder to temporary stream
    System.IO.StringWriter stringWrite = new StringWriter();
    System.Web.UI.HtmlTextWriter htmlWrite = new HtmlTextWriter(stringWrite);

    /********************************************************************************/
    //Try adding source strings for each image in content
    string tempPostContent = getImage(HTMLCode);
    /*********************************************************************************/

    StringReader reader = new StringReader(tempPostContent);

    //Create PDF document
    Document doc = new Document(PageSize.A4);
    HTMLWorker parser = new HTMLWorker(doc);
    PdfWriter.GetInstance(doc, new FileStream(Server.MapPath("~") + "/App_Data/HTMLToPDF.pdf",

    FileMode.Create));
    doc.Open();

    try
    {
    //Parse Html and dump the result in PDF file
    parser.Parse(reader);
    }
    catch (Exception ex)
    {
    //Display parser errors in PDF.
    Paragraph paragraph = new Paragraph("Error!" + ex.Message);
    Chunk text = paragraph.Chunks[0] as Chunk;
    if (text != null)
    {
    text.Font.Color = BaseColor.RED;
    }
    doc.Add(paragraph);
    }
    finally
    {
    doc.Close();
    }
    }

    public string getImage(string input)
    {
    if (input == null)
    return string.Empty;
    string tempInput = input;
    string pattern = @"<img(.|\n)+?>";
    string src = string.Empty;
    HttpContext context = HttpContext.Current;

    //Change the relative URL's to absolute URL's for an image, if any in the HTML code.
    foreach (Match m in Regex.Matches(input, pattern, RegexOptions.IgnoreCase | RegexOptions.Multiline |

    RegexOptions.RightToLeft))
    {
    if (m.Success)
    {
    string tempM = m.Value;
    string pattern1 = "src=[\'|\"](.+?)[\'|\"]";
    Regex reImg = new Regex(pattern1, RegexOptions.IgnoreCase | RegexOptions.Multiline);
    Match mImg = reImg.Match(m.Value);

    if (mImg.Success)
    {
    src = mImg.Value.ToLower().Replace("src=", "").Replace("\"", "");

    if (src.ToLower().Contains("http://") == false)
    {
    //Insert new URL in img tag
    src = "src=\"" + context.Request.Url.Scheme + "://" +
    context.Request.Url.Authority + src + "\"";
    try
    {
    tempM = tempM.Remove(mImg.Index, mImg.Length);
    tempM = tempM.Insert(mImg.Index, src);

    //insert new url img tag in whole html code
    tempInput = tempInput.Remove(m.Index, m.Length);
    tempInput = tempInput.Insert(m.Index, tempM);
    }
    catch (Exception e)
    {

    }
    }
    }
    }
    }
    return tempInput;
    }

    string getSrc(string input)
    {
    string pattern = "src=[\'|\"](.+?)[\'|\"]";
    System.Text.RegularExpressions.Regex reImg = new System.Text.RegularExpressions.Regex(pattern,
    System.Text.RegularExpressions.RegexOptions.IgnoreCase |

    System.Text.RegularExpressions.RegexOptions.Multiline);
    System.Text.RegularExpressions.Match mImg = reImg.Match(input);
    if (mImg.Success)
    {
    return mImg.Value.Replace("src=", "").Replace("\"", ""); ;
    }

    return string.Empty;
    }
    }

Explanation:

In both the above cases, there is a function ConvertHTMLToPDF which takes HTML as input and converts it to PDF format.

The result is stored in a PDF file named HTMLToPDF.pdf in your web site’s App_Data folder.

Remember that you need to write the code to fetch the HTML code from your database or any other file in the

Page_Load event of the above page. Pass this HTML Code to the converter function and it will create the PDF file for you.

If you face any issue, do write in to comments and I will try to help you ASAP.

Software Engineer. inHouseRecipes Food Critic, Editor, Director, Anchor. Food Photographer

Read more by

Google+ | Facebook | Twitter


This story was originally published at AM22 Tech. This printer-friendly version is made available for your personal and non-commercial use only.

© Copyright 2014   AM22 Tech