A More Semantic Web with Schema.org, The Open Graph Protocol and HTML5

Semantic Web with html5, schema.org and the open graph protocol

One of the most important things for any modern business is its internet presence. If you’re not on the internet, or not active and visible on the internet, you might as well not exist to a large group of people. Search Engine Optimisation is the process of improving ones website so that it might appear higher up the Google Search rankings, where more people are likely to find it.

At the same time, one of the most interesting elements of modern software and services  is its openness. Everyone from local councils to The Association of Train Operating Companies is currently in the process of opening up their data to the world and hoping someone innovative, or with a different set of skills and resources, can make something they either couldn’t imagine themselves or didn’t have the time and money to build — for mutual benefit.

One possible enhancements to SEO and Openness for an organisation is to make their website semantic. The definition of Semantics, according to The Oxford Dictionary, is:

The branch of linguistics and logic concerned with meaning. The two main areas are logical semantics, concerned with matters such as sense and reference and presupposition and implication, and lexical semantics, concerned with the analysis of word meanings and relations between them.

The main takeaway point is that things, in this case HTML markup for websites, have meaning. We need to make sure that the meanings we are making visible to the world actually mean what we want them to mean. A nice side-effect of this is that web pages become a lot easier to parse or screen-scrape and extract information from.

HTML5

Prior to HTML5 the best way to give meaning to a tag was to use an id. So if you were to markup a simple website with a header and a list of news stories you might come up with something like this:

<div id="header">
	<h1>News Website</h1>
	<img src="logo.png" alt="logo"/>
</div>
<div id="newslist">
	<div class="story">
		<h2>News Title</h2>
		<p>Here is some exciting news!</p>
	</div>
	<div class="story">
		<h2>Another bit of news</h2>
		<p>A shame, as no news is good news!</p>
	</div>
</div>

Whilst this is relatively clean code, it does come with some issues. How is a screen-reader or search engine spider meant to know the meaning of a “story” element for example? Whilst it seems simple viewing it as a human being, we must remember that there are literally thousands of possibilities for element id names that mean “story”.

HTML5 provides some new Semantic Tags which allow us to bake meaning into elements themselves. Check out the example below which simplifies and improves the previous code using the new HTML 5 semantic tags.

<header>
	<h1>News Website</h1>
	<img src="logo.png" alt="logo"/>
</header>
<main>
	<article>
		<h2>News Title</h2>
		<p>Here is some exciting news!</p>
	</article>
	<article>
		<h2>Another bit of news</h2>
		<p>A shame, as no news is good news!</p>
	</article>
</main>

This implementation allows a browser, spider or screen reader to accurately understand what each element is for as the tag names used have been standardized by the W3C. In case you’re wondering the `<article>` tag is what is detected by browsers like IE and Safari to show a Reading View.

Wherever possible you should aim to use the semantic tags over generatic tags such as `<div>`. It makes code easier to read in addition to being more semantically correct. A full list of the HTML5 semantic tags and their meanings can be found on DiveIntoHTML5.

The Open Graph Protocol

Whilst I had been using HTML5 semantic elements for some time, I wanted to do more as part of the CS Blogs project both in terms of SEO and improving user experience through semantics.

I started with the Open Graph Protocol. The Open Graph protocol was developed by Facebook to allow websites to integrate better with Facebook, both in app and on the web, however other Social Media services also take advantage of open graph, including Pintrest, Twitter and even Google+.

The Open Graph protocol is implemented as a series of `<meta>` tags that you place in the head of your HTML pages. Each page can describe itself as identifying a Person, Movie, Song or other graph object using code such as that shown below for a Blogger on CS Blogs.com

<meta property="og:title" content="The Computer Science Blogs profile of Daniel Brown" />
<meta property="og:site_name" content="Computer Science Blogs"/>
<meta property="og:type" content="profile"/>
<meta property="og:locale" content="en_GB"/>
<meta property="og:image" content="https://avatars.githubusercontent.com/u/342035" />
<meta property="profile:first_name" content="Daniel"/>
<meta property="profile:last_name" content="Brown"/>
<meta property="profile:username" content="dannybrown"/>

As you can see most open graph properties start with an `og:` suffix, except those particular to the type of content you are making available, which are suffixed with the type name. The documentation for what tags are available can be found on the Open Graph Website.

This code will then be used by Facebook when someone links to that particular web page in their messages, or on their newsfeed. Here’s an example:

Open Graph element displayed on Facebook newsfeed

Open Graph element displayed on Facebook newsfeed

Whilst open graph is great for this purpose it does have some limitations. Each page can only be of one type, and you cannot add semantics for more than one element. This limitation is a problem for pages such as csblogs.com/bloggers which represents multiple people.

Despite its limitations its still worth implementing open graphs on pages for which it makes sense, especially if those pages are likely to be shared on social media.

Facebook, as usual, have some great development tools for open graph including the Open Graph Debugger, which allows you to see how Facebook interprets your page (but because Open Graph is a standard it’ll also help you debug any issues with Pintrest, Twitter etc.)

Schema.org

Schema.org is a standard developed in a weird moment of collaboration between the 3 search engine giants — Google, Microsoft and Yahoo. It allows you to specify the meaning of certain elements of content. You can technically do this using 3 different types of syntax, however in this blog post I will focus on micro data, partly because its the easiest to understand, fits inline with your pages and is an official part of the HTML5 spec, but also because its the only format currently fully supported by the Google search engine.

To begin with here is the HTML 5 structure of a blog post before it has been marked up with schema.org micro data. It should be pretty simple to understand if you’ve checked out the HTML 5 semantic elements mentioned previously.

<article>
    <header>
        <h2><a href="dannybrown.net">A Blog Post</a></h2>
    </header>
    <img src="dannybrown.net/image.png" alt="Featured Image"/>
    <p>This is an exert... <a class="read-more" href="dannybrown.net">Read more →</a></p>
    <footer>
        <div class="article-info">
            <a class="avatar" href="/bloggers/dannybrown">
                <img class="avatar" src="dannybrown.net/danny.png" alt="Avatar"/>
            </a>
            <a class="article-author" href="/bloggers/dannybrown">Daniel Brown</a>
            <p class="article-date">1 day ago</p>
        </div>
    </footer>
</article>

In order to markup our html with Schema.org we need to do a few things:

  1. Determine which Schema.org schema best suits the element we are describing.
  2. Determine the scope of that element
  3. Add the microdata attributes to our HTML

For our blog post example above the most relevant schema is BlogPosting. You can see all of the different types in a hierarchy at schema.org. The scope of the BlogPosting is the entire block contained within the `<article>` tags.

The scope of an item is delimited on the opening tag of our scope using the `itemscope` attribute. Read it as “Every bit of micro data within this element is about one item”. When we define the `itemscope` we also need to give it is type — this is done with the `itemtype` attribute. The value of the `itemtype` is the url of the schema.org schema — in our case `http://schema.org/BlogPosting`.

The values of fields that make up our schema, for example the “headline” of a blogpost are either other schemas or the values of elements. Here’s a fully schema’d up blog post:

<article itemscope itemtype="http://schema.org/BlogPosting">
    <header>
        <h2 itemprop="headline"><a href="dannybrown.net">A semantic blog post</a></h2>
    </header>
    <img itemprop="image" src="dannybrown.net/image.png" alt="Featured Image"/>
    <p itemprop="articleBody">This is an exert... <a itemprop="url" class="read-more" href="dannybrown.net">Read more →</a></p>
    <footer>
        <div class="article-info">
			<div itemscope itemprop="author" itemtype="https://schema.org/Person">
                <a class="avatar" href="/bloggers/dannybrown">
                    <img class="avatar" itemprop="image" src="dannybrown.net/danny.png" alt="Avatar"/>
                </a>
                <a class="article-author" itemprop="sameAs" href="/bloggers/dannybrown"><span itemprop="givenName">Daniel</span> <span itemprop="familyName">Brown</span></a>
			</div>
            <p class="article-date" itemprop="datePublished">1 day ago</p>
        </div>
    </footer>
</article>

Here we can see that just by assigning an `itemprop` attribute to a tag, the textual content it contains becomes the value of the named field. We can also see that a Person schema can be nested inside our BlogPosting schema to give us a rich author ‘object’.

One other thing worth noting here is that I elected to add `<span>` elements (which don’t change the visual layout of the HTML page) around the first and last names of the author so as to be able to correctly mark them up with `givenName` and `familyName` itemprops.

Any elements which you mark up with schema.org should be visible to the end user. Writing schema elements into your page and then hiding them via css or JavaScript will actually result in your SEO ratings being reduced, and could impare applications which rely on schema properties. (For example if a screen reader used schema.org properties, which to my knowledge none do yet)

Google provides a debugger for Schema.org, which came in great use whilst I was added in support for CS Blogs, its called the Structured Data Testing Tool. The output for a the home page of csblogs.com is shown below:

Google Structured Data Testing Tool Output

Google Structured Data Testing Tool Output

As you can see using Schema.org means that the Google search engine can actually understand what is on the page, and therefore its semantic meaning. csblogs.com is therefore more likely to go up in search terms that include the word blog, or search for the names of the authors mentioned for example.

Wrapping Up

Hopefully this blog post will have made you think about what you can do to make your websites more semantic — and therefore better for search engines, accessibility and in terms of openness. You can use all three of the technologies above at the same time, and I would implore you to do so. In return you’ll benefit from better Search Engine rankings, your users will benefit from better Social Media integration and screen reading for those with disabilities, and search engines can point people to web pages with a better understanding of what that page represents rather than just scanning for keywords.

Danny

Advertisements

Tags: , , , , , , , , , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s