Saturday, February 24, 2018

Using jsoup with Kotlin to parse HTML

Time and again, I have had a need to parse HTML using Java -- And I have hated it. Partly because of much better tools in other languages.

Recently, I had one such need. I needed to,
  • Fetch HTML response from a URL
  • Parse it and scrape information from it
  • Dump it somewhere
  • Do all of the above using Kotlin
Yuck! Parse HTML in today's world? Unfortunately, there  was no known public API that would return JSON or XML or something else. The information was only available as HTML and only way to get that information was to parse and scrape it.

With that in mind, I went and looked out for libraries available to parse HTML using Java or Kotlin. I stumbled upon jsoup.

Its a nice lightweight library to parse real-world HTML. jsoup API is more or less similar to jquery API -- Which makes it a pleasure to use. Without wasting much time lets just jump right into code.

How Do They Do It!

Lets say, we just had a simple requirement, parse the Google Search Result Page and list all the result title's and URL's.

NOTE: I know that google does expose search API to return JSON response, but for the sake of this example just assume it didn't have any such API.

jsoup can be included via many ways. Here's how we could include it via a gradle file.

Next up, lets write a simple test, it will do all of the above mentioned things. Here are the relevant parts of the code.

This prints out the search index, title and URL from the search result page. Here's the sample output

That's about it!

PS: Here's the link to the sample code used in this post.
Have some Fun!