Read HTML with Java – Then 7 Fun Things to do to It

Written by MikeNereson

August 3rd, 2009 at 10:57 pm

Posted in java

Tagged with ,

With 18 comments

There are several ways to get the HTML content of a URL from Java. There are even more ways to get the HTML using open source java libraries. Last week Lars Vogel shared how to get the HTML using nothing but the SDK.

SDK

final URL url = new URL("http://blog.codehangover.com");
final InputStream inputStream = new InputStreamReader(url);
final BufferedReader reader
             = new BufferedReader(inputStream).openStream();

String line;

while ((line = reader.readLine()) != null) {
   System.out.println(line);
}

reader.close();

Apache Commons HttpClient

You can also use the Apache Commons HttpClient for a slightly easier to use library.

HttpClient client = new HttpClient();
HttpMethod method = new GetMethod("http://blog.codehangover.com");

try {
 client.executeMethod(method);

 byte[] responseBody = method.getResponseBody();

 System.out.println(new String(responseBody));

} catch (Exception e) {

 e.printStackTrace();

} finally {

 method.releaseConnection();
}





But the real fun comes after you get the HTML. Now you get to work with it.

7 Things to do with HTML Source in Java

1. Extract the Text from the Markup

I worked on a web application that was similar to a feed reader. One of the features that we supported was searching the pages. To do this I needed to extract the text from the markup. I use the CyberNeko HTML Parser for this task.

private String getHtmlFilteredString(Reader reader)
{

  // create element remover filter
  ElementRemover remover;
  remover = new ElementRemover();
  remover.removeElement("script");
  remover.removeElement("link");
  remover.removeElement("style");
  remover.removeElement("CDATA");
  remover.removeElement("<!--");
  remover.removeElement("meta");

  OutputStream stream = new ByteArrayOutputStream();

  try
  {
    String encoding = "ISO-8859-1";
    XMLDocumentFilter writer = new Writer(stream, encoding);

    XMLDocumentFilter[] filters = {remover, writer};

    XMLInputSource source = new XMLInputSource(null, null, null, reader, null);

    XMLParserConfiguration parser = new HTMLConfiguration();
    parser.setProperty("http://cyberneko.org/html/properties/filters", filters);

    parser.parse(source);

  } catch (Exception e) {

    e.printStacktrace();
  }

  String content = stream.toString().trim();

  return content;
}

2. Extract Links

To find all of the links in an HTML fragment, maybe for your own spider, or to extract email addresses, you can use HtmlParser


Collection<String> links = new ArrayList<String>();

try {

  URI uriLink = new URI(url);
  Parser parser = new Parser();
  parser.setInputHTML(htmlBody);
  NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (LinkTag.class));

  for (int i = 0; i < list.size (); i++){
    LinkTag extracted = (LinkTag)list.elementAt(i);
    String extractedLink = extracted.getLink();
    links.add(extractedLink);
  }

} catch (Exception e) {

  e.printStackTrace();
}

3. Change Links

Using the previous code, instead of calling getLink, you can call setLink to change the href. For example, this might be used by any type of analytics software that needs to track hits.

4. Collect Email Addresses

Using the previous code, before adding the link to the Collection, chech to see if it uses the mailto protocol by calling the boolean method isMailLink()

  for (int i = 0; i < list.size (); i++){

    LinkTag extracted = (LinkTag)list.elementAt(i);

    if (extracted.isMailLink())
    {
      String extractedLink = extracted.getLink();
      links.add(extractedLink);
    }
  }

5. Collect Images

Still using HtmlParser, you can extract images by filtering on the ImageTag.


Collection<String> imageUrls = new ArrayList<String>();

try {

  URI uriLink = new URI(url);
  Parser parser = new Parser();
  parser.setInputHTML(htmlBody);
  NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (ImageTag.class));

  for (int i = 0; i < list.size (); i++){
    ImageTag extracted = (ImageTag)list.elementAt(i);
    String extractedImageSrc = extracted.getImageUrl();
    imageUrls.add(extractedImageSrc);
  }

} catch (Exception e) {

  e.printStackTrace();
}

6. Add Syntax Highlighting

If you’re going to present the source on screen you have to add syntax highlighting. If you’re displaying on a web page, I would highly recommend using something like SyntaxHighligher, which is what we use on this blog. If you are displaying an a Swing app, you can use a tool called  Syntax which is now several years old.

try {

     ToHTML toHTML = new ToHTML();

     toHTML.setInput(new FileReader("Source.java"));
     toHTML.setOutput(new FileWriter("Source.java.html"));
     toHTML.setMimeType("text/x-java");
     toHTML.setFileExt("java")

     toHTML.writeFullHTML();

 } catch (Exception e){

    e.printStackTrace();
 }

7. Diff Two Sources

Once you get the HTML source of two URLs, or you have an old version of the HTML stored somewhere, you might want a nicely formatted diff output.

Madlep wrote his own method to handle the diff and output the results over on Stackoverflow.com

Given some simple input, his code gives simple output.


String one = "" +
    "<ul>" +
    "  <li>item 1</li>" +
    "  <li>item 2</li>" +
    "</ul>";

String two = "" +
    "<p>This is text</p>" +
    "<ul>" +
    "  <li>item 1</li>" +
    "  <li>item 2</li>" +
    "  <li>item 3</li>" +
    "</ul>";

System.out.println(diffSideBySide(one, two));

Outputs

                     >  <p>This is text</p>
<ul>                    <ul>
 <li>item 1</li>          <li>item 1</li>
 <li>item 2</li>          <li>item 2</li>
                     >    <li>item 3</li>
</ul>                   </ul>

3 More Things to do with the HTML

This list sure feels incomplete at only 7 of a nice even 10 things. What else could you you do with the HTML? Ill pick the top three from the comments and finish this blog post with them.

(read html in java) (java get html source)





Share

Related Posts

Comments

18 Responses to “Read HTML with Java – Then 7 Fun Things to do to It”

  1. MikeNereson on August 21st, 2009 4:43 pm

    Another fun thing to do is to print the HTML to PDF. I used iText for this on a project a couple of years ago http://www.lowagie.com/itext

  2. Adil on November 2nd, 2009 3:57 am

    Hi
    I am having problem in extracting tag which is used for Css(stylesheet) i want to extract tag and after extracting i need its src attribute so i can have its source kindly help me out thanks in Advance

    Regards,
    Adil Badshah

  3. Sathya Narayanan K on January 8th, 2010 4:06 am

    Hi,

    Thanks for posting this blog on reading a html file.
    I have few clarification reg the code. please can you tell me how to declare the “htmlBody” variable , as it is not declared in this blog.

    Thanks in Advance,
    Sathya narayanan K.

  4. S. Metzger on March 31st, 2010 1:50 pm

    As a Java newcomer, i found this a very interesting read! Thanks for writing this article.

  5. MikeNereson on March 31st, 2010 3:02 pm

    @S. Metzger – : ) Thanks for reading and thanks for your comments.

  6. Yaniv on July 13th, 2010 5:13 am

    Thank you for this great article it’s exactly what I needed. I have one question though: after changing the links href, how can I get the new html (the whole page html) from the parser?

    Thanks.

  7. MikeNereson on July 13th, 2010 8:14 am

    @Yaniv – HtmlParser’s way of working with the DOM is through the NodeList. There may not be any effective way of getting the full HTML from the Parser. The Parser documentation states

    The Parser provides access to the contents of the page, via a NodeIterator, a NodeList or a NodeVisitor.

    Typical usage of the parser is:

    Parser parser = new Parser ("http://whatever");
    NodeList list = parser.parse (null);
    // do something with your list of nodes.

    So I don’t think there is any public method to access the HTML. You might have to write something yourself to access the Parser’s underlying model. Good luck.

  8. Jack on November 4th, 2010 5:31 pm

    Hi! Thank you very much for this post! you’r great :D
    i use java for 3 years, i used it for medical scope – dicom – but now i need something that analyze html page and get some values from it , like other page html to get infos ecc.. . I don’t know other potentiality of this technology :D
    thank you !!! byee

  9. billy on December 10th, 2010 6:43 am

    thank you !

    took a couple years of java in high school and have picked it up again not too long ago and this was just what i was looking for i have learned a lot from this post. i tried to find links in html by my own means and after i had exausted myslef i got a very buggy program together. having it done for me and given to me in a way that i can learn how it works makes me <3 everything open source.

    anyways thank you again.
    -billy

  10. MikeNereson on December 10th, 2010 10:40 am

    @billy – Thanks for reading and thank you for your comments.

  11. Tessie on July 10th, 2011 8:17 am

    Wow, this is in every rsepect what I needed to know.

  12. Bhargav on July 28th, 2011 9:04 am

    Thanks for very much.
    I would like to know How can we get HTML object properties from its source code..please help me

  13. AUNGTHU on August 22nd, 2011 6:54 am

    Hi, I want to say thank you for your post. That help me a lot. I am from Burma. Now, I want to develop J2ME Web browser and that browser can act like normal web browser(like Mozilla Firefox, Internet Explorer and Netscape ) in Mobile phone. Can you tell me that project is possible? I need sdk for that project. But I don’t know what sdk is best for J2ME web browser. I think you have great experience in Java and Web Technology. Please reply me. Thank you..

  14. Jeffry on August 22nd, 2011 10:44 pm

    Another thing is to create an application to find out how many particular documents (e.g. .pdf, .doc, etc) are there on a web site.

    The code will recursively scan each paths and find out the number of particular documents and sum it up to an integer counter.

  15. Divya on October 13th, 2011 1:51 pm

    Hi i need to know how to bypass a proxy server and connect with internet so that i can read the source code of a web page only if its URL is given as input..pls help me thank u

  16. Matt on December 22nd, 2011 3:52 am

    Hey I’ve been doing a lot of searching on the internet trying to find out how to make java programs that log into webpages for you, for instance you have your computer set up a connection with hotmail.com and then you give it the required information and it logs in for you so a little write up about that would be super interesting or if someone could email me links or anything else that would be helpful!
    Thanks!

  17. Asya_K on September 16th, 2013 1:47 am

    this works!!!
    now i want to extract date of an article from Html page.. plz help..

  18. jordan retro 11 on April 21st, 2014 9:08 pm

    in exactly a few minutes on a daily basis..
    [url=http://www.jordanretro11concord.us]jordan retro 11[/url]

Leave a Reply