One thing we found missing in the Go ecosystem is a first rate HTML/XML parsing library. The built-in libraries are great for strictly written and on-spec documents, but when you're trying to parse realistic HTML pages, they fall flat. Other languages are spoiled by having amazing libraries like Nokogiri and PyXml to help with these issues.

The solution? Use the brilliant CGO wrapper magic in Go to make our own version of those libraries and bring it to the Go world. The result is the cleverly-named Gokogiri (Get it?). Gokogiri is a lightweight wrapper around the awesome libxml library that makes working with libxml super-easy.

Since we needed this in a high performance system, speed and a small memory footprint were crucial factors in its design. We've been using it in production for about 6 months now and it's working amazingly well! It's got really tight memory and is extremely stable.

To get Gokogiri, run the following in the command line:

$> go get github.com/moovweb/gokogiri

And here is a sample bit of code. Gokogiri can do a lot more but we don't want to bore you with details:

package main  
  
import(  
	"fmt"  
	"github.com/moovweb/gokogiri"  
)  
  
func main() {  
	// Parse even this bad bit of HTML and make it valid  
	html := "<h2>I am so malformatted</h2>"  
	doc, _ := gokogiri.ParseHtml([]byte(html))  
	defer doc.Free()  
  
	header := doc.Root().FirstChild().FirstChild()  
	header.SetName("h1")  
  
	fmt.Println(doc.String())  
}

If we execute this, the HTML parser fixes up the document and makes it valid. So, the output looks like this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"  
"http://www.w3.org/TR/REC-html40/loose.dtd">  
<html><body><h1>I am so malformatted</h1></body></html>

The h2 changed to an h1 and this is now a valid document!

We also support XPath and CSS document searching and all sorts of other goodies. Don't let this simple example hide the fact that Gokogiri is an incredibly powerful library!

Check out the github repo for the project. If you have any suggestions/comments/questions, please don't hesitate to file an issue on github or contact us at platform-dev@dev-moovweb.pantheonsite.io.

Happy parsing!

  • Zhigang