Web data mining using Tidy and Xpath

Mining with a new drill

This is the followup to web agents and web automation in ASP. Currently in the process of re-writing babezone.org I came across my original concepts of data mining. The method I use there is crude, but effective. Ofcourse, I wanted refined and effective. So this is part two of "datamining in ASP"... Hold on to your chairs, kids.

After writing my first functions in ASP that would allow retrieving and filtering webpages I was pretty happy. It worked, albeit sometimes a little quirky. I found out the filtering functions sometimes weren't sophisticated enough to filter out the exact information I needed. So I went looking on the internet for similar solutions.
That's when I came across this excellent paper by Jussi Myllymaki and Jared Jackson, both researchers at IBM. They use a similar technique, but a method that is much more reliable (read the paper for more details). In short, they;

They use WHAT??

Tidy (also known as HTML tidy) is a piece of software that can take HTML and clean it up for you. It can make sure all tags are neatly closed, make sure you nest properly and layout the code for you. Best of all, Tidy is now an open source development, so projects based on it are popping up everywhere. As I mentioned before, Tidy is also capable of converting HTML into XHTML. XHTML is an XML variant of HTML. The great advantage of this is we are able to use more or less standard XML methods to process the document after we use Tidy to convert it. I imagined this would be an excellent improvement to my own filtering functions, so I began carefully figuring things out.

Converting Jussi and Jared's concept to ASP/VBscript

The method descibed in the IBM paper uses Java. Ofcourse, I'm using ASP in my website, so I needed a way to use Tidy from a scripting environment. Being open source, there are a lot of variants of tidy floating around, so after a while I found what I thought would solve my problems: a COM/ATL wrapper for Tidy. First, let me explain the difference between a DLL and a COM DLL (which I found out the hard way):
A DLL contains functions and classes that you can use from a programming language by calling it from your code. A COM DLL is a DLL that you can register in windows. After that, a COM .DLL will expose some of its functions to you for use from, say, ASP =).
The documentation of the COM/ATL wrapper was focused heavily on implementation in .NET. I struggled for a while, but I was unable to use it properly in ASP. The problem was that I found it impossible to set the options using the method "setOptBool". I used the code below:

Dim Tidy,ok Set Tidy = Server.CreateObject("Tidy.Document") ok = Tidy.SetOptBool ("TidyForceOutput", 1)

I've tried all sorts of variations like:

Tidy.SetOptBool 1,TidyForceOutput Tidy.SetOptBool 1,"TidyForceOutput" Tidy.SetOptBool ("TidyForceOutput", 1) Tidy.SetOptBool ("TidyForceOutput", True) Tidy.SetOptBool "TidyForceOutput", 1

None of them worked. This was a pretty big setback, because I needed to configure Tidy to output XML code. After looking a bit more, I found an alternative component that contained Tidy. It is a bit older, but for my purposes it would do. The component I found was TidyCOM.
Now TidyCOM fit the bill perfectly. It was configurable (at last!), and although it misses a lot of the functionality of the newest versions of Tidy, it did what it should; convert HTML to X(ht)ML.

Update:

In the meantime, I recieved a mail from Claude Henchoz who has actually found a workaround for this problem!
Here's what he mailed me:

I wanted to do the exact same thing and could also not figure out how to use it. But I've found a workable workaround:

Use an external tidy-configfile.

I have this file:

tidyconf.txt
---------------------
output-xhtml: yes
numeric-entities: yes
---------------------

and then I include that in the VBScript portion like this:

iStatus = oTidy.LoadConfig("tidyconf.txt")

You might want to add this to your page, because nothing like this is documented anywhere.

Cheers, Claude

I haven't tried it myself, but for anyone out there struggling; it cant hurt to give it a go, thanks Claude!

Tidying things up

After registering TidyCOM on my development server and fiddling for a while, I got it to work. The code below takes a piece of code (for example from the code you'll find under ASP - POST and GET to other sites and retrieve returned HTML on an earlier article - web agents and web automation in ASP) and converts it to XHTML.

1: Function XHTML(code) 2: Dim TidyObj 3: Set TidyObj = Server.CreateObject("TidyCOM.TidyObject") 4: 5: Dim tmp 6: TidyObj.Options.Doctype = "loose" 7: TidyObj.Options.DropFontTags = true 8: TidyObj.Options.OutputXml = true ' set the output type to XML 9: TidyObj.Options.QuoteAmpersand = True 10: TidyObj.Options.CharEncoding = 0 11: TidyObj.Options.Indent = 2 'AutoIndent 12: TidyObj.Options.TabSize = 8 13: tmp = TidyObj.TidyMemToMem(code) 14: set TidyObj = nothing 15: XHTML=tmp 16: End Function

As you can see, TidyCOM is quite easy to use. I ran into two problems you'll have to keep in mind:

I solved the first problem by filtering out any ampersands before processing it with TidyCOM with a Standard VBscript "replace()" function.

Now just follow the yellow brick Xpath

Well, now that we have our document as XML, we have a lot of options of finding whatever information we need. If you want to you can load the document as a DOM and traverse the DOM-tree looking for the element you need, but I tried to keep close to what the IBM-paper suggested. So Xpath it was. I wrote a simple function that takes a piece of XML and an Xpath as arguments and returns the value of the pathelement specified.

Function GetTag(sXML,NodePath) Dim oXMLDoc,tXML Dim oXMLNode Set oXMLDoc = Server.CreateObject("MSXML2.DOMDocument.4.0") tXML = Replace(sXML,"&","") oXMLDoc.LoadXML(tXML) If oXMLDoc.ParseError <> 0 Then GetTag = "PARSE ERROR LINE " & oXMLDoc.ParseError.line &_ " REASON: " & oXMLDoc.ParseError.reason Else Set oXMLNode = oXMLDoc.selectSingleNode(NodePath) GetTag = oXMLNode.Text End If Set oXMLNode = nothing Set oXMLDoc = nothing End Function

Please note that I use MSXML 4 here, this is nessessary for using Xpath functions. MSXML4 can be found here. If the document doesn't parse for some reason the function returns the line number and the reason. If it does parse, the function returns the value of the Xpath-path (can I say that?).

Combining the two functions

Now I had my three main functionalities covered;

Using these three it was relatively easy to write some code to process this into what I needed. Without getting too much into details, here's what I was able to construct using these technologies:

As you can see this can be put to very cool use. Please feel free to experiment a little, for your convieniance I have put the Xpath queries for allposters.com below. You can try them out on any productpage, for example
this page
Here are the Xpath Queries:

name = GetTag(XMLcode,"//span")
image = GetTag(XMLcode,"//td/img/@src")
description = GetTag(XMLcode,"//td[@style='line-height: 18px;']/b")

advantages

Well, as mentioned before, the greatest advantage is being able to use standard XML tools and methods for processing your data. XML pretty much allows you to find any detail you want, attributes or nodevalues, combine them, filter them and/or pass them on to other processors. Pretty cool stuff.

disadvantages

Ofcourse, there are also always disadvantages, the data-mining method described above DOES require you to register some components on the server you're workig on, namely TidyCOM and MSXML4. Also, TidyCOM broke on the ampersand that was inside the script tags, maybe when I use the Tidy COM/ATL wrapper I can permanently fix this.