Using HtmlAgilityPack to GET and POST web forms

In this post I am going to show how to very easily and simply do HTTP GET and POST requests from C# code, without having to use the messy HttpWebRequest and HttpWebResponse objects. For my example, I am going to work with a aspx web form that has 2 input fields: Login and Password.


Lets assume this form is hosted at the URL http://my.site.com/login.aspx. A browser issues a HTTP GET request to this URL, and receive this form back in a HTML document. After users enter their login and password and submit the form, the browser issues a HTTP POST request to the server, passing along the values entered in these 2 fields. But thats not all it sends back. Additionally it sends 2 other types of information:

1) Any cookies that were set by the website on the GET request. For instance, the ASP.NET_SessionId cookie which is used by the ASP.NET application to keep track of user sessions.

2) Any hidden input elements that were part of the form. For instance, the __VIEWSTATE.

Failing to send these information may cause the application to throw an error.

I wanted to mimic this same browser behavior in a C# application. So I created a class called BrowserSession.
public class BrowserSession { } 
The way I use this class is very similar to how I would use a browser:
BrowserSession b = new BrowserSession();
b.Get("http://my.site.com/login.aspx");
b.FormElements["loginTextBox"] = "username";
b.FormElements["passwordTextBox"] = "password";
string response = b.Post("http://my.site.com/login.aspx");
The BrowserSession class encapsulates the setting of all the cookies and hidden input elements, as well as the actual GET and POST requests to the server.

Here's the complete code for the BrowserSession class.
public class BrowserSession
{
    private bool _isPost;
    private HtmlDocument _htmlDoc;

    /// <summary>
    /// System.Net.CookieCollection. Provides a collection container for instances of Cookie class 
    /// </summary>
    public CookieCollection Cookies { get; set; }

    /// <summary>
    /// Provide a key-value-pair collection of form elements 
    /// </summary>
    public FormElementCollection FormElements { get; set; }

    /// <summary>
    /// Makes a HTTP GET request to the given URL
    /// </summary>
    public string Get(string url)
    {
        _isPost = false;
        CreateWebRequestObject().Load(url);
        return _htmlDoc.DocumentNode.InnerHtml;
    }

    /// <summary>
    /// Makes a HTTP POST request to the given URL
    /// </summary>
    public string Post(string url)
    {
        _isPost = true;
        CreateWebRequestObject().Load(url, "POST");
        return _htmlDoc.DocumentNode.InnerHtml;
    }

    /// <summary>
    /// Creates the HtmlWeb object and initializes all event handlers. 
    /// </summary>
    private HtmlWeb CreateWebRequestObject()
    {
        HtmlWeb web = new HtmlWeb();
        web.UseCookies = true;
        web.PreRequest = new HtmlWeb.PreRequestHandler(OnPreRequest);
        web.PostResponse = new HtmlWeb.PostResponseHandler(OnAfterResponse);
        web.PreHandleDocument = new HtmlWeb.PreHandleDocumentHandler(OnPreHandleDocument);
        return web;
    }

    /// <summary>
    /// Event handler for HtmlWeb.PreRequestHandler. Occurs before an HTTP request is executed.
    /// </summary>
    protected bool OnPreRequest(HttpWebRequest request)
    {
        AddCookiesTo(request);               // Add cookies that were saved from previous requests
        if (_isPost) AddPostDataTo(request); // We only need to add post data on a POST request
        return true;
    }

    /// <summary>
    /// Event handler for HtmlWeb.PostResponseHandler. Occurs after a HTTP response is received
    /// </summary>
    protected void OnAfterResponse(HttpWebRequest request, HttpWebResponse response)
    {
        SaveCookiesFrom(response); // Save cookies for subsequent requests
    }

    /// <summary>
    /// Event handler for HtmlWeb.PreHandleDocumentHandler. Occurs before a HTML document is handled
    /// </summary>
    protected void OnPreHandleDocument(HtmlDocument document)
    {
        SaveHtmlDocument(document);
    }

    /// <summary>
    /// Assembles the Post data and attaches to the request object
    /// </summary>
    private void AddPostDataTo(HttpWebRequest request)
    {
        string payload = FormElements.AssemblePostPayload();
        byte[] buff = Encoding.UTF8.GetBytes(payload.ToCharArray());
        request.ContentLength = buff.Length;
        request.ContentType = "application/x-www-form-urlencoded";
        System.IO.Stream reqStream = request.GetRequestStream();
        reqStream.Write(buff, 0, buff.Length);
    }

    /// <summary>
    /// Add cookies to the request object
    /// </summary>
    private void AddCookiesTo(HttpWebRequest request)
    {
        if (Cookies != null && Cookies.Count > 0)
        {
            request.CookieContainer.Add(Cookies);
        }
    }

    /// <summary>
    /// Saves cookies from the response object to the local CookieCollection object
    /// </summary>
    private void SaveCookiesFrom(HttpWebResponse response)
    {
        if (response.Cookies.Count > 0)
        {
            if (Cookies == null)  Cookies = new CookieCollection(); 
            Cookies.Add(response.Cookies);
        }
    }

    /// <summary>
    /// Saves the form elements collection by parsing the HTML document
    /// </summary>
    private void SaveHtmlDocument(HtmlDocument document)
    {
        _htmlDoc = document;
        FormElements = new FormElementCollection(_htmlDoc);
    }
}
For the HTTP requests to the server I am using HtmlAgilityPack. HtmlAgilityPack is a .NET code library that parses HTML files and builds out a read/write DOM that supports plain XPATH or XSLT. It also has a class HtmlWeb that allows getting these HTML documents from the web by providing a nice wrapper for HttpWebRequest and HttpWebResponse objects.

I also have a helper class FormElementCollection which is a key-value-pair collection of all the form elements in the current html document.
/// <summary>
/// Represents a combined list and collection of Form Elements.
/// </summary>
public class FormElementCollection : Dictionary<string, string>
{
    /// <summary>
    /// Constructor. Parses the HtmlDocument to get all form input elements. 
    /// </summary>
    public FormElementCollection(HtmlDocument htmlDoc)
    {
        var inputs = htmlDoc.DocumentNode.Descendants("input");
        foreach (var element in inputs)
        {
            string name = element.GetAttributeValue("name", "undefined");
            string value = element.GetAttributeValue("value", "");
            if (!name.Equals("undefined")) Add(name, value);
        }
    }

    /// <summary>
    /// Assembles all form elements and values to POST. Also html encodes the values.  
    /// </summary>
    public string AssemblePostPayload()
    {
        StringBuilder sb = new StringBuilder();
        foreach (var element in this)
        {
            string value = System.Web.HttpUtility.UrlEncode(element.Value);
            sb.Append("&" + element.Key + "=" + value);
        }
        return sb.ToString().Substring(1);
    }
}
Notice that FormElementCollection only gets the <input> tags. There can be other form elements like <select> and <textarea>. I have left them out for my example here. But this can be very easily extended to include these other tags.

[update 4/19/2010]: I wrote a Part 2 of this article to show how to extend this class to include checkboxes, radio buttons, dropdowns and textareas.

So there is my simple C# browser. I use it primarily to do simple integration tests of my web applications.

10 comments:

box said...

Interesting. :)

Does the BrowserSession class support HTTPS?

box said...

I got an error like this:

"Using the generic type 'System.Collections.Generic.Dictionary' requires '2' type arguments"

I guess the "public class FormElementCollection : Dictionary" line isn't right.

Rohit said...

Yes. I did test it with HTTPS and it works. And sorry about the error. I forgot to escape angle brackets for the Dictionary declaration, and they didn't show up in the post. I have updated the post to fix it. Thanks! :)

box said...

Thank you very much for sharing this class with other people. :)

Oleg said...

Thank you very match! Its saved much time! Thanx a lot!!

cialis said...

Hello, I do not agree with the previous commentator - not so simple

Anonymous said...

You are the best

Anonymous said...

Hi,

Nice class, how do I do a post under Windows Phone 7.

I've modified the code to work with the Phone.

However doing the post, is not working. There is not "POST" parameter.

I would also like to know how to use your library to click a button - it that possible?

Also can I call a script.

Not sure if cookies are gonna work either :-(

Nick

Anonymous said...

Nice post.

I tried to login with "https://premier.ticketek.com.au/membership/login.aspx"

It does not allow me to get login. Can you please check it it urgent.

my id is adeelahmad@msn.com

Anonymous said...

what do i do on subsequent page requests. how do i reuse the cookies that were received after the first call using asp.net. i.e i make a call to remote login page and get logged in. the remote login page sends back cookies. how do i save and keep sending then to subsequent requests without logging in??? i can't figure out how/where to save the remote cookies??

I am a programmer based in Seattle, WA. This is a space where I put notes from my programming experience, reading and training. To browse the list of all articles on this site please go here. You can contact me at rohit@rohit.cc.