Some “webdevelopers” are saying that they code their internetsites to be w3c-valid and that they dislike wysiwyg editors. Even if this Blog is for sure not w3c-valid i’m one of those guys. The question, as written in the topic, why valid html? Why should webdevelopers bother? How can it be used and who benefits from it?
Let’s start at the beginning: The w3c (world wide web consortium; w3 = w w w) is a Consortium which sets the standards for various techniques in the internet (world wide web); for example for HTML. HTML (Hyper Text Markup Language) is the language nearly every website is coded in. Usually people want to work with Standards and not against those, anyway, here are some links: w3c, HTML at the w3c and XHTML/XHTML2 at the w3c.
1. The spanish guy talking english – or: “Page optimized for Browser xyz”
Imagine that a Website is a Speaker on some Event. He’s talking spanish. Even if everything what he’s saying is correct (every native spanish speaker understands him) everyone else won’t understand him. Back to topic: Now imagine some people are optimizing your website for a special browser – Thats like making this spanish guy talking in a broken english – He will use wrong words, half the english people won’t understand him correctly.
2. Parsing HTML #1
Did you ever tried to parse some HTML? It’s possible with a few techniques, for example with xpath and php, or with ruby – Why someone would like to parse HTML? Well, i recently wanted to compare the prices of different companies (domain prices) and make an average value out of it, the problem was that half of the companies didn’t provided their priceslists in TXT/CSV Format (of course) and thus i have to work with the HTML code of them. Using xpath and PHP it’s very easy. A little Script could look like:
<?php
function cleanString($string)
{
$newstring = trim($string);
$newstring = preg_replace('/(\n|\s|\t|\r)/', " ", $newstring);
$newstring = preg_replace('/(\s)+/', " ", $newstring);
return $newstring;
}
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
@$doc->LoadHTMLFile('domains.html');
$xpath = new DOMXPath($doc);
// We starts from the root element
$query = '//html/body/div/div/div/div/table/tbody/tr/td';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$string .= cleanString($entry->textContent)."<br />";
}
$string = preg_replace('/^(<br \/>){9}$/im', '', $string);
echo $string;
$array = explode("<br />", $string);
var_dump($array);
?>
The problem with this script is, that it’s not working properly when the sourcecode of the html file is fucked up invalid. It can result in wrong charset, missing information and other related stuff. (Btw. it wasn’t the question whether companies wanna have their data fetchable)
3. Parsing HTML #2
Another thing, where you might wanna try to parse HTML is when you want to do some templating. Imagine 40 folders containing 2 up to 10 files, containing 10 up to 500 lines of code and you need to replace all texts in this html file with a variable, say for example with {$foobar} – A Script could do this job automatically, for example with ruby (thanks to Sven for helping):
#!/usr/bin/ruby
require 'rubygems'
require 'hpricot'
require 'iconv'
filepath = ARGV[0]
parts = filepath.split("/")
filename = parts[parts.size - 1].split(".")[0]
last_dir = parts[parts.size - 2]
@var_prefix = last_dir + "_" + filename.gsub("_", "-")
file_contents = ""
File.open(filepath, "r") { |f|
file_contents = f.read
}
doc = Hpricot(file_contents)
@texts = Array.new
@global_counter = 0
def generate_next_variable_name()
number = @global_counter
@global_counter = @global_counter + 1
return "{"+@var_prefix+"_var"+sprintf("%02d", number)+"}"
end
def traverse(root)
root.each_child do |elem|
if elem.text?
text_of_elem = elem.to_s
if not text_of_elem.strip.gsub("\n", "").gsub("\r", "").gsub("\t", "").empty? and
text_of_elem.length > 1 and
not text_of_elem[/<input/] and
not text_of_elem[/<option/] and
not text_of_elem[/<a/] and
not text_of_elem[/[\[\]\{\}<>]./]
#print "found: "+elem.to_s+"\n"
var_name = generate_next_variable_name()
@texts << { :text => elem.to_s, :var_name => var_name }
elem.swap(var_name)
end
end
if elem.elem?
traverse(elem)
end
end
end
traverse(doc)
new_html = doc.to_s # enthaelt neues html dokument mit variablenersetzungen
#print new_html
@texts.each do |text|
print '"'+text[:var_name]+'";"'+text[:text]+"\"\n"
end
#File.open(filepath, 'w') { |f| f.write(new_html) }
This Script would give out: ‘”{folder_filename_varXX}”;”the text”‘ but: it’s not working. we had to remove a, input and option tags because with invalid html markup it’s not parsing them correctly (we got opened a tags for example) also with charset mixing you got even more problems like spaces which can’t be removed.
4. Different Browsers, Different Approaches & Search Engine Optimization
But what has all of this “special”-”coding” stuff to do with valid html? Every browser is displaying the page nicely… wrong! Instead of making some Screenshots on my own, just watch this page (scroll down): browser-comparison. Even if two or three of the tested browsers are working fine, this does not automatically mean the page is working in _every_ browser (and there are quite a lot) also – now think about search engines. How important is it for you that search engines can index your website? Now think again about it: How about search engines? If you imagine that a webpage got usually 5-50 subpages and if you’re really into search-engine-optimization this will be important for you (Also it’s the cheapest way to improve your website).
5. Who benefits?
Last Question, who benefits from a valid website? Everyone. Everyone who’s using a browser. Everyone who needs to work with those html files, everyone who’s using some special computer software to access the web (blind people for example). Is your page valid? Check it here: validator.
By the way. For some guys, this snippet might be interesting:
<?php
ob_start();
// now some other php stuff or include some html stuff, or write some html stuff
?>
<html>
<head>
<title>bla</title>
</head>
<body>
<p>Hello World</p>
</body>
</html>
<?php
$html = ob_get_contents(); // now the above html isnt displayed, instead
// it's in $html
ob_end_clean();
$tidy = new tidy;
$tidy->parseString($html, array(
'indent' => true,
'output-xhtml' => true,
'wrap' => 68,
'bare' => true,
'clean' => true,
'drop-proprietary-attributes' => true,
'logical-emphasis' => true,
'word-2000' => true,
'break-before-br' => true,
'force-output' => true), 'UTF8');
echo $tidy; // output cleaned html
?>
Keep a look at html tidy – Just google for it.
