Images in this post missing? We recently lost them in a site migration. We're working to restore these as you read this. Should you need an image in an emergency, please contact us at imagehelp@codebetter.com
XHTML Validation Script using Ruby

In one of the projects I'm working on we produce a number of XHTML documents and we want these documents to be valid XHTML 1.0 Strict. As an example of how automation will set you free, I promptly thought that there was no way I would be submitting several dozens of documents to the W3C XHTML validator.

Instead of go looking for an web service or something that could provide that validation, I thought it would be more interesting and educational for me to try to automate the usage of the W3C validator using some Ruby goodness. I know, there are probably quicker ways of doing that, but I want to get better at Ruby, so sue me.

I'll do my best to explain how the script works, I don't think it turned out too complicated. If anyone has tips for improving it, I'll be glad to hear about and learn even more.

The W3C Markup Validation Service page as of this writing offers the option of uploading a file and have it validated. It's a simple HTTP form POST to the URL http://validator.w3.org/check . The only not-so-trivial task is how to post a file form field. This is when I though I will probably need this type of code again in the future, so let's just write it in a separate file to reuse later. After some research and trials I ended up with the following helper file, called form_post.rb (I'll dissect it in the sequence).

require 'rubygems'
require 'mime/types'
require 'net/http'
require 'CGI'
 
class FormField
  attr_accessor :name, :value
  def initialize( name, value )
    @name = name
    @value = value
  end
 
  def to_form_data
    field = CGI::escape(@name)
    "Content-Disposition: form-data; name=\"#{field}\"" + 
      "\r\n\r\n#{@value}\r\n"
  end
end
 
class FileField
  attr_accessor :name, :path, :content
  def initialize( name, path, content )
    @name = name
    @path = path
    @content = content
  end
 
  def to_form_data
    "Content-Disposition: form-data; " + 
    "name=\"#{CGI::escape(@name)}\"; " + 
    "filename=\"#{@path}\"\r\n" +
    "Content-Transfer-Encoding: binary\r\n" +
    "Content-Type: #{MIME::Types.type_for(@path)}" + 
    "\r\n\r\n#{@content}\r\n"
  end
end

class MultipartPost
  SEPARATOR = 'willvalidate-aaaaaabbbb0000'
  REQ_HEADER = {
      "Content-type" => "multipart/form-data, boundary=#{SEPARATOR} "
   }
 
  def self.build_form_data ( form_fields )
    fields = []
    form_fields.each do |key, value|
      if value.instance_of?(File)
        fields << FileField.new(key.to_s, value.path, value.read)
      else
        fields << FormField.new(key.to_s, value)
      end
    end
    fields.collect {|f| "--#{SEPARATOR}\r\n#{f.to_form_data}" }.join("") + 
         "--#{SEPARATOR}--"
  end
end

Right at the top, we see.

require 'rubygems'
require 'mime/types'
require 'net/http'
require 'CGI'

This is roughly equivalent to assembly references you have in your Visual Studio projects. We are just saying that we will need each of the listed libraries. Just like the .Net Framework, Ruby comes with a wealth of core and utility classes, organized in libraries. The rest of the code in this file will use classes and modules defined in these libraries.

Then comes the FormField class, which represents one simple form field, a name/value pair basically.

class FormField
  attr_accessor :name, :value
  def initialize( name, value )
    @name = name
    @value = value
  end
 
  def to_form_data
    field = CGI::escape(@name)
    "Content-Disposition: form-data; name=\"#{field}\"\r\n\r\n#{@value}\r\n"
  end
end

I won't explain the details of the class declaration syntax because I think Joe Ocampo already did a good job at that (link). Our FormField class has two properties FormField#name and FormField#value (see how we refer to the instance properties and methods in Ruby? We use the Class#method notation.), which represent a form field with name and its value, but only for simple input fields, not a file field yet.

The FormField#to_form_data method (again, note the Ruby convention to have methods in lower case, words separated by underscores). This method will convert the name/value pair into the appropriate HTTP form data POST format. The CGI::escape is simply a class method (static method in C# terms) that will escape any especial characters in the field name.

After that we just return a string with the expected form data layout. In Ruby, the return value of a method does not need to be provided by the return statement, it is optional. If no return statement is used, the return value will be the last evaluated expression — the string in our case. When But wait, there's something interesting in this string. Do you see #{field} and #{@value}? These will be automatically substituted by name and @value, respectively. You can use anything that is in scope and the substitution will be done via a process that is called String Interpolation. This only works with double-quoted strings (other delimiters can be used in Ruby to denote string literals.)

OK, now on to the next class, FileField.

class FileField
  attr_accessor :name, :path, :content
  def initialize( name, path, content )
    @name = name
    @path = path
    @content = content
  end
 
  def to_form_data
    "Content-Disposition: form-data; " + 
    "name=\"#{CGI::escape(@name)}\"; " + 
    "filename=\"#{@path}\"\r\n" +
    "Content-Transfer-Encoding: binary\r\n" +
    "Content-Type: #{MIME::Types.type_for(@path)}" + 
     "\r\n\r\n#{@content}\r\n"
  end
end

After seeing the FormField class, the FileField class becomes easier to understand. It represents one file that we want to include in the form posting as a file input field. It has the field name, the file path, and the file contents. The FileField#to_form_data also converts the file information to the appropriate posting format.

This leads us to the last class in this file.

class MultipartPost
  SEPARATOR = 'willvalidate-aaaaaabbbb0000'
  REQ_HEADER = {
      "Content-type" => "multipart/form-data, boundary=#{SEPARATOR} "
  }
 
  def self.build_form_data ( form_fields )
    fields = []
    form_fields.each do |key, value|
      if value.instance_of?(File)
        fields << FileField.new(key.to_s, value.path, value.read)
      else
        fields << FormField.new(key.to_s, value)
      end
    end
    fields.collect {|f| "--#{SEPARATOR}\r\n#{f.to_form_data}" }.join("") + 
      "--#{SEPARATOR}--"
  end

The MultipartPost class starts by defining two constants SEPARATOR and REQ_HEADER that are strings. The simple fact that these identifiers start with an upper case character makes them constants. From outside the class' code these constants are accessed by prefixing with the class name, as in MultipartPost::SEPARATOR. Then comes the interesting part, the MultipartPost.build_form_data method (by the way, that is the notation for class methods). In this method, we are passed a Hash (like a Hashtable in .Net) containing all the fields that we want to post. We start the method by declaring a variable called fields as an empty Array.

Almost anything that can be enumerated over in Ruby will have an each method, which is a common way of performing for loops and also one of the most fundamental Ruby idioms. When we iterate over a Hash with Hash#each, we provide a block of code, isolated in the snippet below.

                      do |key, value|
      if value.instance_of?(File)
        fields << FileField.new(key.to_s, value.path, value.read)
      else
        fields << FormField.new(key.to_s, value)
      end
    end

Since C# 2.0 we have had anonymous methods, and more so now that C# 3.0 has lambdas, the job of explaining the above code is probably easier than what it used to be in .Net 1.x days. This code is more or less the equivalent of the following C# syntax.

(key, value) => {
    FileInfo fi = value as FileInfo;
    if(fi != null)
        arrayList.Add(new FileField(key.ToString(), fi.Name, File.ReadAllText(fi.Name));
    else
        arrayList.Add(new FormField(key.ToString(), value);
}

What this code is doing in checking each key/value pair in the Hash, and if the value is an instance of the File class, a new instance of FileField will be added to the fields array, otherwise a new FormField will.

At this point the fields array will be a mixed bag of FileField and FormField objects, which would be very undesirable if we were in C# and these classes not being related to each other. Not so much in a Ruby. The last line in this method goes to town with that array.

fields.collect {|f| "--#{SEPARATOR}\r\n#{f.to_form_data}" }.join("") + 
     "--#{SEPARATOR}--"

The Array#collect method (same as Array#map) will convert each item in the array to a pre-formatted string, producing another array with all these strings. Then the Array#join method is called to concatenate all these strings. The end result is a long string with all the form fields formatted appropriately. One thing to note in the array conversion is that the to_form_data method is called for both FormField and FileField items even though the method is not defined in a common base class. That's the power of Duck Typing.

We still need to write the code to post a form to the W3C validator, but fear not, that will be very simple when we use the above classes. Here's code that will do the trick, let's save it to a file called w3c_validation.rb.

require 'net/http'
require 'form_post'

def post_file_to_w3c_validator(file_path, doc_type)
  query = MultipartPost.build_form_data(
        :uploaded_file  => File.new(file_path, 'r'),
        :charset        => '(detect automatically)',
        :doctype        => doc_type,   
        :group          => '0'
        )

  Net::HTTP.start('validator.w3.org') do |http|
    http.post2("/check", query, MultipartPost::REQ_HEADER)
  end
end

def valid_response?(w3c_response)
  html = w3c_response.read_body
  html.include? "[Valid]"
end

def w3c_valid?(file_path)
  resp = post_file_to_w3c_validator(file_path, 'XHTML 1.0 Strict')
  valid_response?(resp)
end

file = "c:/temp/invalid.html"
#file = "c:/temp/valid.html"
result = w3c_valid?(file)
puts "File is valid? #{result}"

Again, at the top of the file we declare which libraries we will need. Note that (line 2) we are asking for form_post, that happens to be the previous file we just examined. As long as you save both files in the same directory, they will be able to reference each other if needed. That line will make all three classes we created available in this file.

The post_file_to_w3c_validator method uses MultipartPost.build_form_data to prepare the contents to post to http://validator.w3.org/check.

def post_file_to_w3c_validator(file_path, doc_type)
  query = MultipartPost.build_form_data(
        :uploaded_file  => File.new(file_path, 'r'),
        :charset        => '(detect automatically)',
        :doctype        => doc_type,   
        :group          => '0'
        )

  Net::HTTP.start('validator.w3.org') do |http|
    http.post2("/check", query, MultipartPost::REQ_HEADER)
  end
end

Remember that we said MultipartPost.build_form_data takes a Hash as its parameter? Well, this is how we create a Hash.

my_hash = { :key1 => value1, :key2 => value2 , ..... }

When the only (or the last non-block) parameter of the method is a Hash we can omit the curlies { } and make the hash look like a list of key/value pairs. Note how we are passing a File object in the :uploaded_file key. You may be asking "what's up with those colon-prefixed identifiers?" Well, they are Symbols, Ruby's way of creating interned strings. Think of them as string constants or "the name of something". They are used a lot as hash keys.

Next up, the actual POST operation.

  Net::HTTP.start('validator.w3.org') do |http|
    http.post2("/check", query, MultipartPost::REQ_HEADER)
  end

I won't explain too much this code but it should be easy to realize that it is posting the query data to the W3C validator URL. Since this is the last statement in the method, the returned value of the method will be the result from the call to Net::HTTP#post2, which happens to be a Net::HTTPResponse object.

def valid_response?(w3c_response)
  html = w3c_response.read_body
  html.include? "[Valid]"
end

The valid_response? method, as indicated by the trailing question mark (yes, you can use question marks in Ruby identifiers,) returns true or false. It takes one of those Net::HTTPResponse objects as a parameter and does a very low-tech analysis of the returned response text. It just checks if the text contains [Valid]. Hey, look, another method that ends in question mark: String#include?.

It's funny how the last method I will explain is the first one to be called. It's the w3c_valid? method. It takes in a file path, posts that file to the validator, and tests the response gotten from that.

def w3c_valid?(file_path)
  resp = post_file_to_w3c_validator(file_path, 'XHTML 1.0 Strict')
  valid_response?(resp)
end

But all these are just a bunch of methods, we still need to invoke them. That's what the last few lines do.

file = "c:/temp/invalid.html"
#file = "c:/temp/valid.html"
result = w3c_valid?(file)
puts "File is valid? #{result}"

The call to puts prints the given string on the screen (std output to be more precise.) The code is trivial and it passes a file path to the w3c_valid? method, then prints the result.

To run this code, first go to line 27 and change the file variable to point to some XHTML file you have (valid or invalid, note the forward slashes) then just execute:

ruby w3c_validation.rb

I will leave as an exercise to the reader the enhancement of the script to find all XHTML files in a given directory, process one by one, and log the result to the screen or to a text file.

Note Note that you may get an error when you try to run the script because you don't have the mime/types library (a Ruby Gem). It's easy to get it installed, just run this from the command line.
gem install mime-types

Note 2 I forgot to mention that the true/false validation output is only the first step. My idea is to parse the entire response and collect the errors and warnings.

Posted 04-02-2008 9:31 PM by sergiopereira
Filed under: ,

[Advertisement]

Comments

RookieProgrammer wrote re: XHTML Validation Script using Ruby
on 06-09-2010 5:11 PM

I'm trying to do this in C#.  But I can't figure out the actual text strings needed in the form post.  Can you display the actual text of the POST ?

About The CodeBetter.Com Blog Network
CodeBetter.Com FAQ

Our Mission

Advertisers should contact Brendan

Subscribe
Google Reader or Homepage

del.icio.us CodeBetter.com Latest Items
Add to My Yahoo!
Subscribe with Bloglines
Subscribe in NewsGator Online
Subscribe with myFeedster
Add to My AOL
Furl CodeBetter.com Latest Items
Subscribe in Rojo

Member Projects
DimeCasts.Net - Derik Whittaker

Friends of Devlicio.us
Red-Gate Tools For SQL and .NET

NDepend

SlickEdit
 
SmartInspect .NET Logging
NGEDIT: ViEmu and Codekana
LiteAccounting.Com
DevExpress
Fixx
NHibernate Profiler
Unfuddle
Balsamiq Mockups
Scrumy
JetBrains - ReSharper
Umbraco
NServiceBus
RavenDb
Web Sequence Diagrams
Ducksboard<-- NEW Friend!

 



Site Copyright © 2007 CodeBetter.Com
Content Copyright Individual Bloggers

 

Community Server (Commercial Edition)