Web data mining using Tidy and Xpath
Mining with a new drill
This is the followup to web agents and web automation in ASP. Currently in the process of re-writing babezone.org I came across my original concepts of data mining. The method I use there is crude, but effective. Ofcourse, I wanted refined and effective. So this is part two of "datamining in ASP"... Hold on to your chairs, kids.
After writing my first functions in ASP that would allow retrieving and filtering webpages I was pretty happy. It worked, albeit sometimes a little quirky. I found out the filtering functions sometimes weren't sophisticated enough to filter out the exact information I needed. So I went looking on the internet for similar solutions.
That's when I came across this excellent paper by Jussi Myllymaki and Jared Jackson, both researchers at IBM. They use a similar technique, but a method that is much more reliable (read the paper for more details).
In short, they;
- retrieve a webpage
- use Tidy to convert the page to solid XHTML
- establish a tag in the document from where they start mining
- get the information they want, relative to this base tag
They use WHAT??
Tidy (also known as HTML tidy) is a piece of software that can take HTML and clean it up for you. It can make sure all tags are neatly closed, make sure you nest properly and layout the code for you. Best of all, Tidy is now an open source development, so projects based on it are popping up everywhere. As I mentioned before, Tidy is also capable of converting HTML into XHTML. XHTML is an XML variant of HTML. The great advantage of this is we are able to use more or less standard XML methods to process the document after we use Tidy to convert it. I imagined this would be an excellent improvement to my own filtering functions, so I began carefully figuring things out.
Converting Jussi and Jared's concept to ASP/VBscript
The method descibed in the IBM paper uses Java. Ofcourse, I'm using ASP in my website, so I needed a way to use Tidy from a scripting environment. Being open source, there are a lot of variants of tidy floating around, so after a while I found what I thought would solve my problems: a COM/ATL wrapper for Tidy. First, let me explain the difference between a DLL and a COM DLL (which I found out the hard way):
A DLL contains functions and classes that you can use from a programming language by calling it from your code. A COM DLL is a DLL that you can register in windows. After that, a COM .DLL will expose some of its functions to you for use from, say, ASP =).
The documentation of the COM/ATL wrapper was focused heavily on implementation in .NET. I struggled for a while, but I was unable to use it properly in ASP. The problem was that I found it impossible to set the options using the method "setOptBool". I used the code below:
Dim Tidy,ok
Set Tidy = Server.CreateObject("Tidy.Document")
ok = Tidy.SetOptBool ("TidyForceOutput", 1)
I've tried all sorts of variations like:
Tidy.SetOptBool 1,TidyForceOutput
Tidy.SetOptBool 1,"TidyForceOutput"
Tidy.SetOptBool ("TidyForceOutput", 1)
Tidy.SetOptBool ("TidyForceOutput", True)
Tidy.SetOptBool "TidyForceOutput", 1
None of them worked. This was a pretty big setback, because I needed to configure Tidy to output XML code. After looking a bit more, I found an alternative component that contained Tidy. It is a bit older, but for my purposes it would do. The component I found was TidyCOM.
Now TidyCOM fit the bill perfectly. It was configurable (at last!), and although it misses a lot of the functionality of the newest versions of Tidy, it did what it should; convert HTML to X(ht)ML.
Update:
In the meantime, I recieved a mail from Claude Henchoz who has actually found a workaround for this problem!
Here's what he mailed me:
I wanted to do the exact same thing and could also not figure out how to use it. But I've found a workable workaround:
Use an external tidy-configfile.
I have this file:
tidyconf.txt
---------------------
output-xhtml: yes
numeric-entities: yes
---------------------
and then I include that in the VBScript portion like this:
iStatus = oTidy.LoadConfig("tidyconf.txt")
You might want to add this to your page, because nothing like this is documented anywhere.
Cheers, Claude
I haven't tried it myself, but for anyone out there struggling; it cant hurt to give it a go, thanks Claude!
Tidying things up
After registering TidyCOM on my development server and fiddling for a while, I got it to work.
The code below takes a piece of code (for example from the code you'll find under ASP - POST
and GET to other sites and retrieve returned HTML on an earlier article - web agents and web
automation in ASP) and converts it to XHTML.
1: Function XHTML(code)
2: Dim TidyObj
3: Set TidyObj = Server.CreateObject("TidyCOM.TidyObject")
4:
5: Dim tmp
6: TidyObj.Options.Doctype = "loose"
7: TidyObj.Options.DropFontTags = true
8: TidyObj.Options.OutputXml = true ' set the output type to XML
9: TidyObj.Options.QuoteAmpersand = True
10: TidyObj.Options.CharEncoding = 0
11: TidyObj.Options.Indent = 2 'AutoIndent
12: TidyObj.Options.TabSize = 8
13: tmp = TidyObj.TidyMemToMem(code)
14: set TidyObj = nothing
15: XHTML=tmp
16: End Function
As you can see, TidyCOM is quite easy to use. I ran into two problems you'll have to keep in mind:
- TidyCOM breaks on Ampersands (&) if they are within Javascript tags (mostly in URLS)
- TidyCOM has to use CharEncoding set to 0 if the page contains characters that are beyond ASCII (127)
I solved the first problem by filtering out any ampersands before processing it with TidyCOM with a Standard VBscript "replace()" function.
Now just follow the yellow brick Xpath
Well, now that we have our document as XML, we have a lot of options of finding whatever information we need. If you want to you can load the document as a DOM and traverse the DOM-tree looking for the element you need, but I tried to keep close to what the IBM-paper suggested. So Xpath it was. I wrote a simple function that takes a piece of XML and an Xpath as arguments and returns the value of the pathelement specified.
Function GetTag(sXML,NodePath)
Dim oXMLDoc,tXML
Dim oXMLNode
Set oXMLDoc = Server.CreateObject("MSXML2.DOMDocument.4.0")
tXML = Replace(sXML,"&","")
oXMLDoc.LoadXML(tXML)
If oXMLDoc.ParseError <> 0 Then
GetTag = "PARSE ERROR LINE " & oXMLDoc.ParseError.line &_
" REASON: " & oXMLDoc.ParseError.reason
Else
Set oXMLNode = oXMLDoc.selectSingleNode(NodePath)
GetTag = oXMLNode.Text
End If
Set oXMLNode = nothing
Set oXMLDoc = nothing
End Function
Please note that I use MSXML 4 here, this is nessessary for using Xpath functions. MSXML4 can be found here. If the document doesn't parse for some reason the function returns the line number and the reason. If it does parse, the function returns the value of the Xpath-path (can I say that?).
Combining the two functions
Now I had my three main functionalities covered;
- HTTP POST and GET from asp (see the previous page)
- Tidy in asp to process the HTML and convert it to XML
- MSXML4 for processing the XML with Xpath
Using these three it was relatively easy to write some code to process this into what I needed. Without getting too much into details, here's what I was able to construct using these technologies:
- Open my maintenance tools for babezone, in a different window open allposters.com
- search in allposters.com, in the results, I drag and drop a thumbnail of a poster into my maintenance tools
- my tools detect the drag and drop and retrieve the URL it points to (the thumbnail I drag is actually a URL)
- the html is retrieved by MSXML and processed into XHTML by tidy
- for every bit of information I need from the allposters-page I wrote an XPATH query
- The values I get back from Xpath are put into a webform, so I can doublecheck them
- All I have to do is press 'submit' and a new poster is added to babezone
As you can see this can be put to very cool use. Please feel free to experiment a little, for your convieniance I have put the Xpath queries for allposters.com below. You can try them out on any productpage, for example
this page
Here are the Xpath Queries:
name = GetTag(XMLcode,"//span")
image = GetTag(XMLcode,"//td/img/@src")
description = GetTag(XMLcode,"//td[@style='line-height: 18px;']/b")
advantages
Well, as mentioned before, the greatest advantage is being able to use standard XML tools and methods for processing your data. XML pretty much allows you to find any detail you want, attributes or nodevalues, combine them, filter them and/or pass them on to other processors. Pretty cool stuff.
disadvantages
Ofcourse, there are also always disadvantages, the data-mining method described above DOES require you to register some components on the server you're workig on, namely TidyCOM and MSXML4. Also, TidyCOM broke on the ampersand that was inside the script tags, maybe when I use the Tidy COM/ATL wrapper I can permanently fix this.