Read HTML with Java – Then 7 Fun Things to do to It

Written by MikeNereson

August 3rd, 2009 at 10:57 pm

Posted in java

Tagged with ,

With 16 comments

There are several ways to get the HTML content of a URL from Java. There are even more ways to get the HTML using open source java libraries. Last week Lars Vogel shared how to get the HTML using nothing but the SDK.

SDK

final URL url = new URL("http://blog.codehangover.com");
final InputStream inputStream = new InputStreamReader(url);
final BufferedReader reader
             = new BufferedReader(inputStream).openStream();

String line;

while ((line = reader.readLine()) != null) {
   System.out.println(line);
}

reader.close();

Apache Commons HttpClient

You can also use the Apache Commons HttpClient for a slightly easier to use library.

HttpClient client = new HttpClient();
HttpMethod method = new GetMethod("http://blog.codehangover.com");

try {
 client.executeMethod(method);

 byte[] responseBody = method.getResponseBody();

 System.out.println(new String(responseBody));

} catch (Exception e) {

 e.printStackTrace();

} finally {

 method.releaseConnection();
}

But the real fun comes after you get the HTML. Now you get to work with it.

7 Things to do with HTML Source in Java

1. Extract the Text from the Markup

I worked on a web application that was similar to a feed reader. One of the features that we supported was searching the pages. To do this I needed to extract the text from the markup. I use the CyberNeko HTML Parser for this task.

private String getHtmlFilteredString(Reader reader)
{

  // create element remover filter
  ElementRemover remover;
  remover = new ElementRemover();
  remover.removeElement("script");
  remover.removeElement("link");
  remover.removeElement("style");
  remover.removeElement("CDATA");
  remover.removeElement("<!--");
  remover.removeElement("meta");

  OutputStream stream = new ByteArrayOutputStream();

  try
  {
    String encoding = "ISO-8859-1";
    XMLDocumentFilter writer = new Writer(stream, encoding);

    XMLDocumentFilter[] filters = {remover, writer};

    XMLInputSource source = new XMLInputSource(null, null, null, reader, null);

    XMLParserConfiguration parser = new HTMLConfiguration();
    parser.setProperty("http://cyberneko.org/html/properties/filters", filters);

    parser.parse(source);

  } catch (Exception e) {

    e.printStacktrace();
  }

  String content = stream.toString().trim();

  return content;
}

2. Extract Links

To find all of the links in an HTML fragment, maybe for your own spider, or to extract email addresses, you can use HtmlParser


Collection<String> links = new ArrayList<String>();

try {

  URI uriLink = new URI(url);
  Parser parser = new Parser();
  parser.setInputHTML(htmlBody);
  NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (LinkTag.class));

  for (int i = 0; i < list.size (); i++){
    LinkTag extracted = (LinkTag)list.elementAt(i);
    String extractedLink = extracted.getLink();
    links.add(extractedLink);
  }

} catch (Exception e) {

  e.printStackTrace();
}

3. Change Links

Using the previous code, instead of calling getLink, you can call setLink to change the href. For example, this might be used by any type of analytics software that needs to track hits.

4. Collect Email Addresses

Using the previous code, before adding the link to the Collection, chech to see if it uses the mailto protocol by calling the boolean method isMailLink()

  for (int i = 0; i < list.size (); i++){

    LinkTag extracted = (LinkTag)list.elementAt(i);

    if (extracted.isMailLink())
    {
      String extractedLink = extracted.getLink();
      links.add(extractedLink);
    }
  }

5. Collect Images

Still using HtmlParser, you can extract images by filtering on the ImageTag.


Collection<String> imageUrls = new ArrayList<String>();

try {

  URI uriLink = new URI(url);
  Parser parser = new Parser();
  parser.setInputHTML(htmlBody);
  NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (ImageTag.class));

  for (int i = 0; i < list.size (); i++){
    ImageTag extracted = (ImageTag)list.elementAt(i);
    String extractedImageSrc = extracted.getImageUrl();
    imageUrls.add(extractedImageSrc);
  }

} catch (Exception e) {

  e.printStackTrace();
}

6. Add Syntax Highlighting

If you’re going to present the source on screen you have to add syntax highlighting. If you’re displaying on a web page, I would highly recommend using something like SyntaxHighligher, which is what we use on this blog. If you are displaying an a Swing app, you can use a tool called  Syntax which is now several years old.

try {

     ToHTML toHTML = new ToHTML();

     toHTML.setInput(new FileReader("Source.java"));
     toHTML.setOutput(new FileWriter("Source.java.html"));
     toHTML.setMimeType("text/x-java");
     toHTML.setFileExt("java")

     toHTML.writeFullHTML();

 } catch (Exception e){

    e.printStackTrace();
 }

7. Diff Two Sources

Once you get the HTML source of two URLs, or you have an old version of the HTML stored somewhere, you might want a nicely formatted diff output.

Madlep wrote his own method to handle the diff and output the results over on Stackoverflow.com

Given some simple input, his code gives simple output.


String one = "" +
    "<ul>" +
    "  <li>item 1</li>" +
    "  <li>item 2</li>" +
    "</ul>";

String two = "" +
    "<p>This is text</p>" +
    "<ul>" +
    "  <li>item 1</li>" +
    "  <li>item 2</li>" +
    "  <li>item 3</li>" +
    "</ul>";

System.out.println(diffSideBySide(one, two));

Outputs

                     >  <p>This is text</p>
<ul>                    <ul>
 <li>item 1</li>          <li>item 1</li>
 <li>item 2</li>          <li>item 2</li>
                     >    <li>item 3</li>
</ul>                   </ul>

3 More Things to do with the HTML

This list sure feels incomplete at only 7 of a nice even 10 things. What else could you you do with the HTML? Ill pick the top three from the comments and finish this blog post with them.

(read html in java) (java get html source)

  • Share/Bookmark

Getting friendly with Spring, JUnit and EasyMock.

Written by DanEngland

August 1st, 2009 at 8:31 pm

Posted in java

Tagged with , ,

With no comments

Here are some steps that can get you using Spring, JUnit and EasyMock all together in some Test Driven Development hotness.

Start by adding the following lines to the top of your unit test. Specifying Autowire by name ensures you get the injection you want and will stop those Spring errors that there are more than one of the same type of mock objects in your mock-applicationContext.xml. When you specify the Spring Junit runner you must provide one ore more context configurations with @ContextConfiguration.

MyClassUnitTest.java

...
@Configurable(autowire = Autowire.BY_NAME)
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = {"classpath:com/some/domain/someproject/resources/mock-applicationContext.xml"})
public class MyClassUnitTest {
...
     @Autowired private Collaborator mockCollaborator;

     @Before
     public void setup() throws Exception {
     ...
     }

     @After
     public void teardown() throws Exception {
     ...
     }

     @Test
     public void testMyClass() throws Exception {

          //Set your mock behavior here
          ....

          EasyMock.replay(mockCollaborator);

          MyClass myClass = new MyClass(mockCollaborator);
          myClass.run();
          EasyMock.verify(mockCollaborator);

          // Other JUnit assertions
          ...
     }
 }

Then add your mocks to your mock-applicationContext.xml (an applicationContext in your test resources just for providing your unit tests with spring injection dependencies):

mock-applicationContext.xml

...
<bean id="mockCollaborator" name="mockCollaborator" class="org.easymock.EasyMock" factory-method="createStrictMock">
     <constructor-arg value="com.some.domain.someproject.Collaborator"/>
</bean>

<bean id="mockOtherCollaborator" name="mockOtherCollaborator" class="org.easymock.EasyMock" factory-method="createStrictMock">
     <constructor-arg value="com.some.domain.someproject.OtherCollaborator"/>
</bean>
...

Ensure your maven pom has the following test-scoped dependency:

pom.xml

...
<dependency>
     <groupId>org.springframework</groupId>
     <artifactId>spring-test</artifactId>
     <version>2.5.6</version>
     <scope>test</scope>
</dependency>
...

Then use your mocks as normal. Make sure you call EasyMock.reset(mock) in your @After teardown() method so that each of your mocks are reset for each test. You can also tell Spring to reset the context after a test if needed with @DirtiesContext.

For official documentation and tutorials check out these links.

Spring

http://www.springsource.org/documentation

JUnit

http://www.junit.org

EasyMock

http://easymock.org/Documentation.html

  • Share/Bookmark

New and Unknown Java Libraries

Written by MikeNereson

July 30th, 2009 at 10:38 am

Posted in Software Tools

Tagged with ,

With one comment

I like finding new useful java libraries. I usually find them from posts like this one. Other times I find them because I have a problem needing a solution. Here are some of my favorite unknown java libraries that I have found over the past year. Today I use every one of these in my projects. Interestingly, 4 of 6 are hosted on http://code.google.com.

Google-API-Translate-Java

Provides a simple, unofficial, Java client API for using Google Translate.  I use this to translate caption files for videos into several other languages. It has lots of options and has never failed me.

XmlTool

XMLTool is a very simple Java library to be able to do all sorts of common operations with an XML document with a very easy to use class using the Fluent Interface pattern to facilitate XML manipulations.

XStream

XStream is a simple library to serialize objects to XML and back again. Also useful for creating JSON responses.

Architecture Rules

Architecture Rules leverages an xml configuration file and optional programmatic configuration to assert your code’s architecture via unit tests or ant tasks. This test is able to assert that specific packages do not depend on others and is able to check for and report on cyclic dependencies among your project’s packages and classes. Get cyclic dependency detection with the Maven 2 plugin and zero configuration.

CyberNeko HTML Parser

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. This can be used to extract the textual content from an HTML fragment.

Charts4j

charts4j is a free, lightweight charts & graphs Java API. It enables developers to programmatically create the charts available in the Google Chart API through a straightforward and intuitive Java API.

What do you use?

Do you have any new and unknown java tools that you use that you would recommend we checkout?

  • Share/Bookmark