Read HTML with Java – Then 7 Fun Things to do to It
There are several ways to get the HTML content of a URL from Java. There are even more ways to get the HTML using open source java libraries. Last week Lars Vogel shared how to get the HTML using nothing but the SDK.
SDK
final URL url = new URL("http://blog.codehangover.com");
final InputStream inputStream = new InputStreamReader(url);
final BufferedReader reader
= new BufferedReader(inputStream).openStream();
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
reader.close();
Apache Commons HttpClient
You can also use the Apache Commons HttpClient for a slightly easier to use library.
HttpClient client = new HttpClient();
HttpMethod method = new GetMethod("http://blog.codehangover.com");
try {
client.executeMethod(method);
byte[] responseBody = method.getResponseBody();
System.out.println(new String(responseBody));
} catch (Exception e) {
e.printStackTrace();
} finally {
method.releaseConnection();
}
But the real fun comes after you get the HTML. Now you get to work with it.
7 Things to do with HTML Source in Java
1. Extract the Text from the Markup
I worked on a web application that was similar to a feed reader. One of the features that we supported was searching the pages. To do this I needed to extract the text from the markup. I use the CyberNeko HTML Parser for this task.
private String getHtmlFilteredString(Reader reader)
{
// create element remover filter
ElementRemover remover;
remover = new ElementRemover();
remover.removeElement("script");
remover.removeElement("link");
remover.removeElement("style");
remover.removeElement("CDATA");
remover.removeElement("<!--");
remover.removeElement("meta");
OutputStream stream = new ByteArrayOutputStream();
try
{
String encoding = "ISO-8859-1";
XMLDocumentFilter writer = new Writer(stream, encoding);
XMLDocumentFilter[] filters = {remover, writer};
XMLInputSource source = new XMLInputSource(null, null, null, reader, null);
XMLParserConfiguration parser = new HTMLConfiguration();
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);
parser.parse(source);
} catch (Exception e) {
e.printStacktrace();
}
String content = stream.toString().trim();
return content;
}
2. Extract Links
To find all of the links in an HTML fragment, maybe for your own spider, or to extract email addresses, you can use HtmlParser
Collection<String> links = new ArrayList<String>();
try {
URI uriLink = new URI(url);
Parser parser = new Parser();
parser.setInputHTML(htmlBody);
NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (LinkTag.class));
for (int i = 0; i < list.size (); i++){
LinkTag extracted = (LinkTag)list.elementAt(i);
String extractedLink = extracted.getLink();
links.add(extractedLink);
}
} catch (Exception e) {
e.printStackTrace();
}
3. Change Links
Using the previous code, instead of calling getLink, you can call setLink to change the href. For example, this might be used by any type of analytics software that needs to track hits.
4. Collect Email Addresses
Using the previous code, before adding the link to the Collection, chech to see if it uses the mailto protocol by calling the boolean method isMailLink()
for (int i = 0; i < list.size (); i++){
LinkTag extracted = (LinkTag)list.elementAt(i);
if (extracted.isMailLink())
{
String extractedLink = extracted.getLink();
links.add(extractedLink);
}
}
5. Collect Images
Still using HtmlParser, you can extract images by filtering on the ImageTag.
Collection<String> imageUrls = new ArrayList<String>();
try {
URI uriLink = new URI(url);
Parser parser = new Parser();
parser.setInputHTML(htmlBody);
NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (ImageTag.class));
for (int i = 0; i < list.size (); i++){
ImageTag extracted = (ImageTag)list.elementAt(i);
String extractedImageSrc = extracted.getImageUrl();
imageUrls.add(extractedImageSrc);
}
} catch (Exception e) {
e.printStackTrace();
}
6. Add Syntax Highlighting
If you’re going to present the source on screen you have to add syntax highlighting. If you’re displaying on a web page, I would highly recommend using something like SyntaxHighligher, which is what we use on this blog. If you are displaying an a Swing app, you can use a tool called Syntax which is now several years old.
try {
ToHTML toHTML = new ToHTML();
toHTML.setInput(new FileReader("Source.java"));
toHTML.setOutput(new FileWriter("Source.java.html"));
toHTML.setMimeType("text/x-java");
toHTML.setFileExt("java")
toHTML.writeFullHTML();
} catch (Exception e){
e.printStackTrace();
}
7. Diff Two Sources
Once you get the HTML source of two URLs, or you have an old version of the HTML stored somewhere, you might want a nicely formatted diff output.
Madlep wrote his own method to handle the diff and output the results over on Stackoverflow.com
Given some simple input, his code gives simple output.
String one = "" +
"<ul>" +
" <li>item 1</li>" +
" <li>item 2</li>" +
"</ul>";
String two = "" +
"<p>This is text</p>" +
"<ul>" +
" <li>item 1</li>" +
" <li>item 2</li>" +
" <li>item 3</li>" +
"</ul>";
System.out.println(diffSideBySide(one, two));
Outputs
> <p>This is text</p>
<ul> <ul>
<li>item 1</li> <li>item 1</li>
<li>item 2</li> <li>item 2</li>
> <li>item 3</li>
</ul> </ul>
3 More Things to do with the HTML
This list sure feels incomplete at only 7 of a nice even 10 things. What else could you you do with the HTML? Ill pick the top three from the comments and finish this blog post with them.
(read html in java) (java get html source)
Getting friendly with Spring, JUnit and EasyMock.
Here are some steps that can get you using Spring, JUnit and EasyMock all together in some Test Driven Development hotness.
Start by adding the following lines to the top of your unit test. Specifying Autowire by name ensures you get the injection you want and will stop those Spring errors that there are more than one of the same type of mock objects in your mock-applicationContext.xml. When you specify the Spring Junit runner you must provide one ore more context configurations with @ContextConfiguration.
MyClassUnitTest.java
...
@Configurable(autowire = Autowire.BY_NAME)
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = {"classpath:com/some/domain/someproject/resources/mock-applicationContext.xml"})
public class MyClassUnitTest {
...
@Autowired private Collaborator mockCollaborator;
@Before
public void setup() throws Exception {
...
}
@After
public void teardown() throws Exception {
...
}
@Test
public void testMyClass() throws Exception {
//Set your mock behavior here
....
EasyMock.replay(mockCollaborator);
MyClass myClass = new MyClass(mockCollaborator);
myClass.run();
EasyMock.verify(mockCollaborator);
// Other JUnit assertions
...
}
}
Then add your mocks to your mock-applicationContext.xml (an applicationContext in your test resources just for providing your unit tests with spring injection dependencies):
mock-applicationContext.xml
...
<bean id="mockCollaborator" name="mockCollaborator" class="org.easymock.EasyMock" factory-method="createStrictMock">
<constructor-arg value="com.some.domain.someproject.Collaborator"/>
</bean>
<bean id="mockOtherCollaborator" name="mockOtherCollaborator" class="org.easymock.EasyMock" factory-method="createStrictMock">
<constructor-arg value="com.some.domain.someproject.OtherCollaborator"/>
</bean>
...
Ensure your maven pom has the following test-scoped dependency:
pom.xml
...
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<version>2.5.6</version>
<scope>test</scope>
</dependency>
...
Then use your mocks as normal. Make sure you call EasyMock.reset(mock) in your @After teardown() method so that each of your mocks are reset for each test. You can also tell Spring to reset the context after a test if needed with @DirtiesContext.
For official documentation and tutorials check out these links.
Spring
http://www.springsource.org/documentation
JUnit
EasyMock
http://easymock.org/Documentation.html
New and Unknown Java Libraries
I like finding new useful java libraries. I usually find them from posts like this one. Other times I find them because I have a problem needing a solution. Here are some of my favorite unknown java libraries that I have found over the past year. Today I use every one of these in my projects. Interestingly, 4 of 6 are hosted on http://code.google.com.
Google-API-Translate-Java
Provides a simple, unofficial, Java client API for using Google Translate. I use this to translate caption files for videos into several other languages. It has lots of options and has never failed me.
XmlTool
XMLTool is a very simple Java library to be able to do all sorts of common operations with an XML document with a very easy to use class using the Fluent Interface pattern to facilitate XML manipulations.
XStream
XStream is a simple library to serialize objects to XML and back again. Also useful for creating JSON responses.
Architecture Rules
Architecture Rules leverages an xml configuration file and optional programmatic configuration to assert your code’s architecture via unit tests or ant tasks. This test is able to assert that specific packages do not depend on others and is able to check for and report on cyclic dependencies among your project’s packages and classes. Get cyclic dependency detection with the Maven 2 plugin and zero configuration.
CyberNeko HTML Parser
NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. This can be used to extract the textual content from an HTML fragment.
Charts4j
charts4j is a free, lightweight charts & graphs Java API. It enables developers to programmatically create the charts available in the Google Chart API through a straightforward and intuitive Java API.
What do you use?
Do you have any new and unknown java tools that you use that you would recommend we checkout?