Visual basic / ASP - web agents and web automation
introduction
The internet is being used to create application-like interfaces more and more. It is a huge
source of information that can be used by anyone, provided they have a browser, and have the
ability to sometimes fill out a form with requested information or follow a simple link.
Interfaces on the internet have been designed with people in mind. If you start thinking
about what a pity it really is that only humans should be able to use the information on the
internet, you'll understand why I wrote this component. The component is a server-side activeX
component that can be used to 'automate' the web. I also decided to translate the code to pure
VBscript for use in ASP. If you already know a bit about webautomation and agent technology,
I have provided some alternatives to my component below. This in case it is not exactly what
you need. If you have no clue to what I'm on about, feel free to skip the alternatives and
read the explanation of webautomation below first.
alternatives
There are alternatives to this component, ofcourse. I myself have put some time and effort into some of them. I'll name them and their disadvantages in my eyes, so you can check them out yourself if you want to:
- 1. WEBL (say: webble)
-
WEBL is a technology developed by Compaq and it has a lot of advanced features such as FTP and
XML support. It is however, no longer being developed. I downloaded the latest version (3.0.h)
and found there was a bug in it. I had to manually repair it and re-compile the Java-code to
get it to work. This bug has been detected by WEBL users some time ago, but the bugfix didn't
come, because noone was still developing WEBL. It is a scripting language you have to use from
a Java environment.
It is very powerful though, if you need more power, check it out, but I must warn you it has a steep learning-curve and its error messages aren't really helpful. Oh yes, and you need a java environment to run it in. Either the Java SDK or the JRE is needed. - 2. Web Automation Toolkit
-
W.A.T. is a free tool by webmethods that was given away for free some time ago. It was a very
powerful and userfriendly tool. It analyzed a webpage, gave you the formfields and you could
automate those and get the results. All of this from a pleasant GUI.
The toolkit has been taken down since, and I mailed webmethods three times in vain to ask them to send me a copy. They say they haven't got a copy of it anymore. The technology they use has been integrated into their business-to-business software and is no longer available as a standalone product. So this was promising but no longer available. - 3. Rebol
-
To be honest I haven't really put much effort into Rebol, it seems like a very powerful
language that can do a lot more than just webautomation, but I needed a quick solution, and if
I can avoid extra bells and whistles, I will.
Rebol is a scripting-language not unlike perl, you can run from a webserver or as a standalone program. It has a very small footprint that allows you to use all major TCP/IP protocols (such as FTP,HTTP,POP3) and makes them scriptable. the author is one of the people that worked on the Amiga OS. I'm still a big fan of the Amiga, that's why I found Rebol interesting.
Check it out if you're interested in running a very powerful, yet very small scripting language.
Web Automation: an explanation
The best way I can explain the advantages of webautomation is to give you an example. Suppose you have an Internetsite that sells posters through two or three affiliates. Let's say your affiliates are www.postersfast.com and www.posternow.com. What you have to do each time you want to update your site is go to both posternow and postersfast and see if there are any posters that match your criteria (in other words, if there are any posters on those you want on your site). After you find them, you will have to add them to your site, include thumbnails and information on their location and make them visible on your own site.
This example isn't far from the problem I was facing when I wrote the tools to keep my own site, www.babezone.org up-to-date. I wanted to make the action of adding a poster to the site a lot easier and a lot less manual. The problem is the web is designed to be used by humans. And although more and more sites make (parts of) their content available as XML datafeeds (babezone.org also does this), most of the web is still only available for human eyes.
Web automation and web agents are small programs that interact with a website to collect information. That is also the reason they are sometimes called 'information harvesters' or 'data-mining' programs. They mimic human behaviour in a webbrowser and analyse and filter the data that is returned.
For my own purposes I needed my webagents to accomplish the following tasks:
- access a webpage and fill out a form using both GET and POST
- analyse and filter returned data from a webpage
I started out writing the code in Visual Basic. The idea was to compile the code into a .DLL file I could use on both a server and a normal workstation. The .DLL contains the functions to fill out HTML forms using both POST and GET. It also has filterfunctions so that you can filter the code you want and use it. Because it is a .DLL you can use it in any windows code, ASP, C++, VB.
I decided not to publish the .DLL here, because it was unstable the last time I tried it and I haven't installed the older version of visual studio in which I developed it. Also, the ASP code below is actually far more sophisticated than the VB code of the component. If you are really interested in it mail me using the form on the sidebar of this site.
Because not all providers allow people to use their own .DLL files on a server, I also translated the code into pure VBscript for use in ASP. This resulted in a couple of VBscript functions I will outline below.
ASP - filtering HTML streams
- ClearHTMLTags
- Clears a piece of text from any HTML elements. This is useful if you have found the element you need and want to filter out any HTML formatting:
function ClearHTMLTags(strHTML)
dim regEx, strTagLess
'---------------------------------------
strTagless = strHTML
set regEx = New RegExp
regEx.IgnoreCase = True
regEx.Global = True
'---------------------------------------
regEx.Pattern = "<[^>]*>" 'this pattern matches any html tag
strTagLess = regEx.Replace(strTagLess, "") 'all html tags are stripped
'--------------------------------------- set regEx = nothing
ClearHTMLTags = strTagLess
end function
- GetTag
- Expects the following inputs: strHTML - The input HTML string you want filtered GT_TagToGet - What HTML tag you want to have GT_TagIndex - Which one you want (by number) example: GetTagCont(returncode,"font",1) - returns the contents of the first FONT tag in the returncode
function GetTag(strHTML,GT_TagToGet,GT_TagIndex)
'Variables used in the function
dim regEx, strResult, expressionmatch, expressionmatched, matchcount
'---------------------------------------
matchcount = 0
strResult = strHTML
set regEx = New RegExp
regEx.IgnoreCase = True
regEx.Global = True
'---------------------------------------
regEx.Pattern = "<" & GT_TagToGet & "[^>]*>"
Set expressionmatch = regEx.Execute(strResult)
If expressionmatch.Count > 0 Then
For Each expressionmatched in expressionmatch
matchcount = matchcount + 1
If matchcount = GT_TagIndex Then
strResult = expressionmatched.Value
End If
Next
Else
strResult = RegEx.Pattern & " was not found in the string: " & StringToSearch & "."
End If
set regEx = nothing
GetTag = strResult
end function
- Tag
- Explanation is included in the function
'*****************************************************************
'* Function Tag - extracts all positions of tags of a type
'* [tagtoget] puts them into an array. the 0 index of the
'* array contains the length of all tags in the string [DocPart]
'* c 2000 - Erik Oosterwaal
'*****************************************************************
Function Tag(byval T_DocPart, byval T_TagToGet)
Dim found
Dim counter
Dim AllTags()
found = 1
counter = 1
T_TagToGet = "<" & T_TagToGet
while Instr(found,T_DocPart,T_TagToGet) <> 0
found = Instr(found,T_DocPart,T_TagToGet)
ReDim preserve AllTags(counter)
AllTags(counter)=found
'response.write("number:" & counter & "position:" & found)
found = found + 1
counter = counter + 1
wend
AllTags(0)=counter-1
Tag = AllTags
End Function
- GetAttribute
- Retrieves any attribute in a given HTML-sequence
function GetAttribute(strHTML,GT_AttToGet)
'Variables used in the function
dim regEx, strResult, expressionmatch, expressionmatched
'---------------------------------------
strResult = strHTML
set regEx = New RegExp
regEx.IgnoreCase = True
regEx.Global = True
'---------------------------------------
regEx.Pattern = GT_AttToGet & "\s*=\s*""([^""]+)"""
Set expressionmatch = regEx.Execute(strResult)
If expressionmatch.Count > 0 Then
For Each expressionmatched in expressionmatch
strResult = expressionmatched.Value
Next
regEx.Pattern = """[^""]*"""
Set expressionmatch = regEx.Execute(strResult)
For Each expressionmatched in expressionmatch
strResult = expressionmatched.Value
Next
Else
strResult = GT_AttToGet & " was not found in the string: " & strHTML & "."
End If
set regEx = nothing
GetAttribute = Replace(strResult,"""","")
end function
ASP - POST and GET to other sites and retrieve returned HTML
The pieces of code above give you the unique ability to filter any HTML-code you want.
Use them at your own discretion, but please name me as the original author.
To be able to use them you will have to be able to retrieve HTML from a different site. There are some commercial and some freeware solutions in the form of HTTP-components that can do this for you, but the whole purpose of these functions is that we wouldn't need extra components !
Luckily, this is where Microsoft comes in, if you are hosted on an IIS server, and your provider has kept it up-to-date, the server has a component installed called MSXML. MSXML comes in a couple of versions, but all of them offer XMLHTTP, some even better than others.
The newest versions have a special server side XMLHTTP function that seems to work better than the ones in the older versions. Personally, I have never encountered any complicated problems using MSXML 2.x and the standard XMLHTTP function. You'll have to find out what version(s) of MSXML your provider has and call the corresponding object. Below I have written a small piece of code that should get you started:
- Example: retrieving HTML from another site
- Explanation is in the code.
Dim objXMLHTTP, xml, returncode
' Create an xmlhttp object:
Set xml = Server.CreateObject("Microsoft.XMLHTTP")
' Or, for version 3.0 of XMLHTTP, use:
' Set xml = Server.CreateObject("MSXML2.ServerXMLHTTP")
' Open the connection to the remote server:
xml.Open "POST", "http://www.detelefoongids.nl/wit/scripts/dtgi.dll/white", False
' When POSTing to a dll, you'll sometimes have to add the following line:
' xml.setRequestHeader "Content-Type","application/x-www-form-urlencoded"
'send the remote page what it expects, you'll have to deduct this form
' the form it usually expects:
'in this example I have imaginatively re-created a version of the original form
'and I am submitting to this code.
'This way I can just pass the Request.QueryString to the remote page.
xml.Send cstr(request.querystring)
' Actually Sends the request and returns the data:
returncode = xml.responseText
Set xml = Nothing
' Now filter the returncode and do with the data what you like