C# Spider with HttpWebRequest

I was recently tasked to write code to facilitate html post requests to a website that hits a cgi application which in turn sets a configuration file that sends text streams to be displayed on TVs at the Kennedy Space Center. This component was to be built in support of the Weather Warning Appplication.

Weather warnings are a big deal at KSC due to the vastness of the center and many people work outdoors for a good part of the day. Long story short the code below details a simple web request call. I actually make an initial call to a login page and retain the cookie to maintain state on all subsequent requests. After this is done I continue hitting the required config. html form pages passing in the required form field values via the strPost variable. I hope this code helps anyone struggling with this.

FYI: All params are brought in from web.config.
(Notice the ConfigurationSettings.AppSettings["Key"]; calls)

The values in relation to each key look something like this:
loginId=foo&password=bar

public void setTvCrawlDisplay(String tvText, DateTime tvExpiration)
{
   //JFB - Calculate minutes for warning display.    
   DateTime currentDateTime = DateTime.Now;

   //JFB - Difference in days, hours, and minutes.    
   TimeSpan tvTimeSpan = tvExpiration - currentDateTime;
      
   //JFB - Difference in minutes.    
   int differenceInMinutes = (int) tvTimeSpan.TotalMinutes;   
   
   String url = ConfigurationSettings.AppSettings["URL"];
   String strPost = ConfigurationSettings.AppSettings["LoginPassword"];
   StreamWriter myWriter = null;
   CookieContainer myContainer = new CookieContainer();
   
   //Request #1 (the login)    
   HttpWebRequest objRequest = (HttpWebRequest)WebRequest.Create(url);
   objRequest.Method = "POST";
   objRequest.ContentLength = strPost.Length;
   objRequest.ContentType = "application/x-www-form-urlencoded";         
   objRequest.CookieContainer = new CookieContainer();

   try
   {
      myWriter = new StreamWriter(objRequest.GetRequestStream());
      myWriter.Write(strPost);
   }
   catch (Exception e)
   {
      Console.WriteLine(e.Message);
   }
   finally
   {
      myWriter.Close();
   }
      
   HttpWebResponse objResponse = (HttpWebResponse)objRequest.GetResponse();
   //retain the cookies    
   foreach (Cookie cook in objResponse.Cookies)
   {
      myContainer.Add(cook);
   }
   
   //Check out the html.    
   using (StreamReader sr =
          new StreamReader(objResponse.GetResponseStream()) )
   {
      String test = sr.ReadToEnd();

      // Close and clean up the StreamReader       
      sr.Close();
   }
   
   //Request #2 (select the proper submenu)    
   objRequest = (HttpWebRequest)WebRequest.Create(url);
   strPost = strPost = ConfigurationSettings.AppSettings["EditMenu"];
   objRequest = (HttpWebRequest)WebRequest.Create(url);
   objRequest.Method = "POST";
   objRequest.ContentLength = strPost.Length;
   objRequest.ContentType = "application/x-www-form-urlencoded";         
   objRequest.CookieContainer = myContainer;

   try
   {
      myWriter = new StreamWriter(objRequest.GetRequestStream());
      myWriter.Write(strPost);
   }
   catch (Exception e)
   {
      Console.WriteLine(e.Message);
   }
   finally
   {
      myWriter.Close();
   }
      
   objResponse = (HttpWebResponse)objRequest.GetResponse();
   
   //Check out the html.
   
   using (StreamReader sr =
          new StreamReader(objResponse.GetResponseStream()) )
   {
      String test = sr.ReadToEnd();

      // Close and clean up the StreamReader       
      sr.Close();
   }
}

Comments (Comment Moderation is enabled. Your comment will not appear until approved.)
Thank you very much. This code helped me.
# Posted By Rajiv | 3/4/07 11:39 PM
That's what this blog is here for, glad to be of assistance Rajiv and thank your for a kind response
# Posted By Jeff Bouley | 3/5/07 9:05 AM
I used your code but my (HttpWebResponse)webReq.GetResponse() goes in hibernation. I waited for 5 mins, does not recover. If I issue it through fiddler, the things work fine. Why would that happen? Where do I see the request body?
# Posted By Ameya | 4/20/07 12:09 AM
Ameya, the content of the request is passed to the "test" var at the end. I'm not sure why you cannot step through the code and see this. Odd about fiddler allowing it to work...
# Posted By Jeff Bouley | 5/7/07 6:36 PM
Hi,
this is shailesh, i found your site when i search webreqest method h iin goole
help in goolge i try to use your C# webrequest spider for address book graber in y
yahoo website.

Please help me for same
# Posted By shailesh | 5/23/07 9:20 AM
Shailesh, I'm not quite understanding what you're wanting to do. It seems like you stated you are wanting to dynamically pull back address data from yahoo. You will have to analyze the links required to traverse the address book information and then parse the results. The spider will definitely help you in pulling back the content at the very least. Start there and see if you can save the content to a file. Good luck.
# Posted By Jeff Bouley | 5/27/07 8:31 PM
Thanks for the replay ,
i try to develop contact list grabber same as http://www.webdataextractor.net this site right now i have complited ,yahoo, msn contact list grabber .right now i am developing gmail contact list grabber.

do you have any idea about streamread and networkstream.
i have use in MSN contact list grabber but it's take 3 min to grabs contact

Thanks
Shailesh
# Posted By shailesh | 6/6/07 9:05 AM
Hi Jeff,

Thank you for a great entry. I have been able to take your code and use it right away in one of my own applications.

Good Job!
# Posted By Morten | 7/25/07 4:36 PM
This appears to be pretty much what I needed, but just to confirm, What I've been looking for is code to submit information in a webform, and retrieve the resulting page. This code here would do that correct?

If so, is there any way to get an example c# solution that has everything that this is missing, such as the missing "using" statements and such? I know I'm asking alot :(

If not, Thanks for the code :)
# Posted By Alex | 12/2/07 6:34 PM
@Alex

I'm not sure I understand what you are asking for with regard to missing using statements; but yes, this code will submit information in a html form and pull back the resulting page.
# Posted By Jeff Bouley | 12/3/07 6:38 AM
What I mean by missing statements, is simply that the code wont work on its own. For example, "web.config" that you brought the values in from, and such, I meant examples of those. If thats not possible, I will just tinker around with it for a bit to try to figure it out :P

Thanks for the incredibly quick response, and for confirming that fact :)
Also, very nice blog here :)
# Posted By Alex | 12/3/07 2:44 PM
Thanks for the compliment Alex.

Actually in my example there are only 2 settings which you may not even need. You'll probably have to create your own. It was intranet so probably would be useless. I added the two example entries below to give you an idea.

<add key="URL" value="https://127.0.0.1/acm/acm.cgi"/>
<add key="LoginPassword" value="loginuser=theUser&amp;loginpassword=thePassword"/>
# Posted By Jeff Bouley | 12/4/07 9:11 PM
Thanks for your time Jeff, While this code is incredibly useful and I am glad I found it and took the time to learn it, it turns out it was not what I needed, though it is something I was going to need in one of my future projects. Its really useful code and it truly does its job well.

Thanks again for your time, patience, and assistance :)
# Posted By Alex | 12/5/07 3:46 PM
Great post!
You don't need to close the StreamReader since your code is inside a 'using' statement. 'using' will invoke sr.Dispose().
# Posted By Chris | 12/7/07 7:28 PM
@Chris

I hear you. But is doesn't hurt ;). Thanks for the kind words.
# Posted By Jeff Bouley | 12/7/07 11:18 PM
Would you help me to see why the code below does not work with the login of yahoo.com mail. Thanks. Please note the username and passwd of mine are okay, you can replace them with your yahoo id and test the code. It used to work before, but not any more.
Thanks.

---------------------------


using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using System.Net;

namespace httpWebReqAndResp
{
class Program
{
static void Main(string[] args)
{
string strLoginName = "xxxxxxxxxx";
string strPassword = "xxxxxxxxxx";

try
{
string strURL = "http://login.yahoo.com/config/login?";
string strPostData = String.Format("login={0}&passwd={1}",
strLoginName.Trim(), strPassword.Trim());

// Setup the http request.
HttpWebRequest webReq = WebRequest.Create(strURL) as HttpWebRequest;
webReq.Method = "POST";
webReq.ContentLength = strPostData.Length;
webReq.ContentType = "application/x-www-form-urlencoded";
webReq.AllowAutoRedirect = true;
webReq.CookieContainer = new CookieContainer();

// Post to the login form.
StreamWriter streamRequestWriter = new StreamWriter(webReq.GetRequestStream());
streamRequestWriter.Write(strPostData);
streamRequestWriter.Close();

// Get the response.
HttpWebResponse webResp = (HttpWebResponse)webReq.GetResponse();

// Have some cookies.
CookieCollection cookieCollection = webResp.Cookies;

// Read the response

Stream datastream = webResp.GetResponseStream();
StreamReader reader = new StreamReader(datastream);
String strResponseFromServer = reader.ReadToEnd();

Console.WriteLine(strResponseFromServer);
Console.ReadLine();
}

catch (WebException e)
{
Console.WriteLine("\nMain 1 Exception raised!");
Console.WriteLine("\nMessage 1:{0}", e.Message);
Console.WriteLine("\nStatus 1:{0}", e.Status);
Console.WriteLine("Press any key to continue..........");
Console.ReadLine();
}
catch (Exception e)
{
Console.WriteLine("\nMain 2 Exception raised!");
Console.WriteLine("Source 2:{0} ", e.Source);
Console.WriteLine("Message 2:{0} ", e.Message);
Console.WriteLine("Press key to continue..........");
Console.ReadLine();
}
}
}
}
# Posted By Minh Man Le | 12/17/07 4:42 PM
i am getting error:::(407) Proxy Authentication Required. can you please suggest a solution for it.
# Posted By Sudeep | 12/28/07 1:10 AM
Hey Jeff,

Thanks for the code, saves me from having to think and write. I know, i'm lazy! But that's why Google is there for, isn't it?

Good luck with your projects!
# Posted By gices | 2/12/08 6:40 AM
Hi,
I have a URL like https://username@test.com:password@test.com/filename.ext
i have to get username. password and filename from it and need to authenicate it with the website user/pass. Once authnticated i need to provide them with the file mentioned in the URl .I am unable to parse the URL.
Please help me out.
# Posted By Nick | 4/21/08 5:03 PM
@Nick, can't you just do a split? Capture the url in a variable and run this code to split the contents into an array:


char[] delimiterChars = { '@' };

string text = "https://username@test.com:password@test.com/filename.ext";
System.Console.WriteLine("Original text: '{0}'", text);

string[] urlParts = text.Split(delimiterChars);
System.Console.WriteLine("{0} parts in text:", urlParts.Length);

foreach (string s in urlParts)
{
      System.Console.WriteLine(s);
}
# Posted By Jeff Bouley | 4/22/08 7:45 AM
I able to parse the URL.Basically i am working with WebRequest class like
string url = "https://username:password@server/filename.xml.zip";
WebRequest request = WebRequest.Create(strURI);
response = request.GetResponse();

I am getting error on GetResponse. like
InnerException = {"No connection could be made because the target machine actively refused it"}

I am sure my username. password are good.
# Posted By Nick | 4/22/08 5:14 PM
thank you this is very useful for me thank u
# Posted By muthukumar | 6/17/08 4:45 AM
this is very useful to maintain the state in webservice
its also helped to post the more than one request with maintaining the same session state thank u
# Posted By muthukumar | 6/17/08 4:47 AM
Hi,
The response from the login page of a site is <script>loaction.replace("homepage.aspx")</script>
When i read the response and write it in my page, the url searches for the home.aspx page on my server and gives resource not found error.

Any idea how to handle this?
# Posted By Satish | 6/30/08 2:24 PM
@Satish, sorry for the slow reply, very busy these days. You need to check for that script tag and then append the URL.

I recommend parsing out the aspx file and if no dir structure is present prefix it with the url. All you are getting is relocated via javascript after a successful login I'm guessing?

Hope this helps.
# Posted By Jeff Bouley | 7/15/08 2:09 PM
After successful login, the response being returned is
location.replace("homepage.aspx"). It also returned few cookies.
Like you said, I tried appending the URL with in the javascript, but
the remote web server treated this as an entirely new request, redirecting back to the
login page.
I did send the cookies through the request.

Thank you
Satish
# Posted By Satish | 7/16/08 9:46 AM
Took me forever to figure this out.

Boolean Certificate_isValid(object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors) {
X509Store store = new X509Store(StoreName.My, StoreLocation.LocalMachine);
store.Open(OpenFlags.OpenExistingOnly | OpenFlags.ReadOnly | OpenFlags.IncludeArchived | OpenFlags.MaxAllowed);
return store.Certificates.Contains(certificate);
}
...
ServicePointManager.ServerCertificateValidationCallback = Certificate_isValid;


-AH
# Posted By alex | 9/12/08 7:03 PM
I don't think you need to pass a blank cookiecontainer, then iterate the container and add it to myContainer - I believe you can set myContainer as the cookie container on the first call and it will be pulled in - doesn't make sense but it seems .net passes it byref instead of byval
# Posted By Andrew Traub | 9/16/08 10:31 PM
Excellent post it has been very helpful!

I have a question concerning what to do AFTER you login to a page.

For example, I want to login to www.foxsheets.com (a sports info site) and once logged in navigate to a Yankees matchup page (containing sports stats about the game) and get the content of the page. I would then have a local copy of the Yankees game data to read off-line.

I have been able to login, but once I am logged in if I try to navigate to another url in the domain it loses the fact that I am logged in and the website reports back "you are not logged in". Here is what I thought would work:

public static string LoginToFormUsingPostAndGetContentOfSomeUrl(string loginUrl, string strPost, string someUrl, string strGet)
{
StreamWriter myWriter = null;
CookieContainer myContainer = new CookieContainer();

HttpWebRequest objRequest = (HttpWebRequest)WebRequest.Create(loginUrl);
objRequest.Method = "POST";
objRequest.ContentLength = strPost.Length;
objRequest.ContentType = "application/x-www-form-urlencoded";
objRequest.CookieContainer = new CookieContainer();

try
{
myWriter = new StreamWriter(objRequest.GetRequestStream());
myWriter.Write(strPost);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
finally
{
myWriter.Close();
}

HttpWebResponse objResponse = (HttpWebResponse)objRequest.GetResponse();

foreach (Cookie cook in objResponse.Cookies)
{
myContainer.Add(cook);
}

// up to here everything worked fine, I can read the response stream and see that I am logged in.

objRequest = (HttpWebRequest)WebRequest.Create(someUrl);
objRequest.Method = "GET";
objRequest.ContentType = "text/xml; encoding='utf-8'";
// I thought that saving the cookie would preserve my login, I do this here:
objRequest.CookieContainer = myContainer;

try
{
myWriter = new StreamWriter(objRequest.GetRequestStream());
myWriter.Write(strPost);
}
catch (Exception e)
{
Console.WriteLine(e.Message);
}
finally
{
myWriter.Close();
}
objResponse = (HttpWebResponse)objRequest.GetResponse();

using (StreamReader sr =
new StreamReader(objResponse.GetResponseStream()))
{
// Here is where I get "not logged in" (I've been redirected to another url because it thinks I'm not logged in anymore)
String test = sr.ReadToEnd();

return test;
}
}

// and a unit test for this:

[Test]
public void LoginToFormUsingPostAndGetContentOfSomeUrl()
{

string loginUrl = "http://foxsheets.statfox.com/login/submit.asp";
string strPost = "UserName=MyUserName&Password=MyPassword&image.x=31&image.y=7";
string someUrl = "http://foxsheets.statfox.com/foxsheet.asp?s=mlb&am...
string strGet = "/foxsheet.asp?s=MLB&g=20080917NYYANKEES&r=at HTTP/1.1";

string foxsheetsContent = WebsiteContentDownloader.LoginToFormUsingPostAndGetContentOfSomeUrl(loginUrl, strPost, someUrl, strGet);

Regex r = new Regex("CHI WHITE SOX (83 - 66) at NY YANKEES (80 - 70)");

Match m = r.Match(foxsheetsContent);

Assert.IsTrue(m.Success);
}


Any help? Much thanks!
# Posted By Tom H | 9/17/08 3:37 AM
It works for me ok where there is no login. It's great.

But In event of a POST login it's not working for me..

This is the sourcecode of loginpage:

&lt;form method=&quot;POST&quot; action=&quot;/index.php&quot; name=&quot;flogin&quot;&gt;
      &lt;input type=&quot;hidden&quot; name=&quot;query&quot; value=&quot;&quot;&gt;
&lt;!--   &lt;form method=&quot;POST&quot; action=&quot;main.php?idpage=index&quot;&gt; --&gt;
   &lt;tr&gt;

    &lt;td class=&quot;botonlogin&quot;&gt;
         Usuario: &lt;input class=&quot;botonlogin&quot; type=&quot;text&quot; name=&quot;tUsuario&quot; size=&quot;8&quot;&gt; Clave: &lt;input class=&quot;botonlogin&quot; type=&quot;password&quot; name=&quot;tClave&quot; size=&quot;8&quot;&gt;&lt;input type=&quot;submit&quot; value=&quot;Enviar&quot; name=&quot;bEnviar&quot; class=&quot;botonlogin&quot;&gt;
            &lt;input type=&quot;hidden&quot; name=&quot;param&quot;&gt;&lt;a href=&quot;index.php?idpage=enviarpasswd&quot;&gt;No me acuerdo de mi clave&lt;/a&gt;
   &lt;/td&gt;
   &lt;/tr&gt;
&lt;/form&gt;

(www.virtuamanager.com)

I use as strPost

&quot;tUsuario=LOGIN&amp;amp;tClave=PASSWORD&quot;

and as url http://www.virtuamanager.com/index.php
and get no authentication cookie.
Any idea on this please? I'm lost.
# Posted By cad | 10/17/08 8:44 AM
I'm more interested in the entire paradigm of displaying output over-the-top to HDTV?
# Posted By Clinton Gallagher | 5/25/09 12:25 PM
Hello, I have some doubts about the code of spider that you have created.

1 - If I wanted to login me to www.deltron.com.pe, the code that would use is the same as you have?

2 - If on the contrary I would like to get all combinations of search http://www.gallito.com/autos that both exchange the code?

Thanks for your answers !!
# Posted By polonet | 3/3/10 10:29 PM
Thanks for the post. I'm trying to write something similar in Perl. It works fine with websites that don't require cookies. Unfortunately perl doesnt have object, any ideas to do this with methods only?
# Posted By John | 7/29/10 9:39 AM
Hi John, wow! It has been 15 years since I wrote in perl... Good luck there. If you are having to use Perl due to legacy or current infrastructure, I feel for you. I recommend you attempt to sell mgmt. on the benefits of a more modern/popular language like c# or java.

Don't get me wrong, I loved Perl when I used it, but there are better mousetraps now...
# Posted By Jeff Bouley | 7/29/10 11:11 AM

Copyright Strikefish, Inc., 2005. All rights reserved.