Read HTML with Java – Then 7 Fun Things to do to It

There are several ways to get the HTML content of a URL from Java. There are even more ways to get the HTML using open source java libraries. Last week Lars Vogel shared how to get the HTML using nothing but the SDK.

SDK

final URL url = new URL("http://blog.codehangover.com");
final InputStream inputStream = new InputStreamReader(url);
final BufferedReader reader
             = new BufferedReader(inputStream).openStream();

String line;

while ((line = reader.readLine()) != null) {
   System.out.println(line);
}

reader.close();

Apache Commons HttpClient

You can also use the Apache Commons HttpClient for a slightly easier to use library.

HttpClient client = new HttpClient();
HttpMethod method = new GetMethod("http://blog.codehangover.com");

try {
 client.executeMethod(method);

 byte[] responseBody = method.getResponseBody();

 System.out.println(new String(responseBody));

} catch (Exception e) {

 e.printStackTrace();

} finally {

 method.releaseConnection();
}

But the real fun comes after you get the HTML. Now you get to work with it.

7 Things to do with HTML Source in Java

1. Extract the Text from the Markup

I worked on a web application that was similar to a feed reader. One of the features that we supported was searching the pages. To do this I needed to extract the text from the markup. I use the CyberNeko HTML Parser for this task.

private String getHtmlFilteredString(Reader reader)
{

  // create element remover filter
  ElementRemover remover;
  remover = new ElementRemover();
  remover.removeElement("script");
  remover.removeElement("link");
  remover.removeElement("style");
  remover.removeElement("CDATA");
  remover.removeElement("<!--");
  remover.removeElement("meta");

  OutputStream stream = new ByteArrayOutputStream();

  try
  {
    String encoding = "ISO-8859-1";
    XMLDocumentFilter writer = new Writer(stream, encoding);

    XMLDocumentFilter[] filters = {remover, writer};

    XMLInputSource source = new XMLInputSource(null, null, null, reader, null);

    XMLParserConfiguration parser = new HTMLConfiguration();
    parser.setProperty("http://cyberneko.org/html/properties/filters", filters);

    parser.parse(source);

  } catch (Exception e) {

    e.printStacktrace();
  }

  String content = stream.toString().trim();

  return content;
}

2. Extract Links

To find all of the links in an HTML fragment, maybe for your own spider, or to extract email addresses, you can use HtmlParser


Collection<String> links = new ArrayList<String>();

try {

  URI uriLink = new URI(url);
  Parser parser = new Parser();
  parser.setInputHTML(htmlBody);
  NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (LinkTag.class));

  for (int i = 0; i < list.size (); i++){
    LinkTag extracted = (LinkTag)list.elementAt(i);
    String extractedLink = extracted.getLink();
    links.add(extractedLink);
  }

} catch (Exception e) {

  e.printStackTrace();
}

3. Change Links

Using the previous code, instead of calling getLink, you can call setLink to change the href. For example, this might be used by any type of analytics software that needs to track hits.

4. Collect Email Addresses

Using the previous code, before adding the link to the Collection, chech to see if it uses the mailto protocol by calling the boolean method isMailLink()

  for (int i = 0; i < list.size (); i++){

    LinkTag extracted = (LinkTag)list.elementAt(i);

    if (extracted.isMailLink())
    {
      String extractedLink = extracted.getLink();
      links.add(extractedLink);
    }
  }

5. Collect Images

Still using HtmlParser, you can extract images by filtering on the ImageTag.


Collection<String> imageUrls = new ArrayList<String>();

try {

  URI uriLink = new URI(url);
  Parser parser = new Parser();
  parser.setInputHTML(htmlBody);
  NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (ImageTag.class));

  for (int i = 0; i < list.size (); i++){
    ImageTag extracted = (ImageTag)list.elementAt(i);
    String extractedImageSrc = extracted.getImageUrl();
    imageUrls.add(extractedImageSrc);
  }

} catch (Exception e) {

  e.printStackTrace();
}

6. Add Syntax Highlighting

If you’re going to present the source on screen you have to add syntax highlighting. If you’re displaying on a web page, I would highly recommend using something like SyntaxHighligher, which is what we use on this blog. If you are displaying an a Swing app, you can use a tool called  Syntax which is now several years old.

try {

     ToHTML toHTML = new ToHTML();

     toHTML.setInput(new FileReader("Source.java"));
     toHTML.setOutput(new FileWriter("Source.java.html"));
     toHTML.setMimeType("text/x-java");
     toHTML.setFileExt("java")

     toHTML.writeFullHTML();

 } catch (Exception e){

    e.printStackTrace();
 }

7. Diff Two Sources

Once you get the HTML source of two URLs, or you have an old version of the HTML stored somewhere, you might want a nicely formatted diff output.

Madlep wrote his own method to handle the diff and output the results over on Stackoverflow.com

Given some simple input, his code gives simple output.


String one = "" +
    "<ul>" +
    "  <li>item 1</li>" +
    "  <li>item 2</li>" +
    "</ul>";

String two = "" +
    "<p>This is text</p>" +
    "<ul>" +
    "  <li>item 1</li>" +
    "  <li>item 2</li>" +
    "  <li>item 3</li>" +
    "</ul>";

System.out.println(diffSideBySide(one, two));

Outputs

                     >  <p>This is text</p>
<ul>                    <ul>
 <li>item 1</li>          <li>item 1</li>
 <li>item 2</li>          <li>item 2</li>
                     >    <li>item 3</li>
</ul>                   </ul>

3 More Things to do with the HTML

This list sure feels incomplete at only 7 of a nice even 10 things. What else could you you do with the HTML? Ill pick the top three from the comments and finish this blog post with them.

  • Share/Bookmark