Read HTML with Java – Then 7 Fun Things to do to It
There are several ways to get the HTML content of a URL from Java. There are even more ways to get the HTML using open source java libraries. Last week Lars Vogel shared how to get the HTML using nothing but the SDK.
SDK
final URL url = new URL("http://blog.codehangover.com");
final InputStream inputStream = new InputStreamReader(url);
final BufferedReader reader
= new BufferedReader(inputStream).openStream();
String line;
while ((line = reader.readLine()) != null) {
System.out.println(line);
}
reader.close();
Apache Commons HttpClient
You can also use the Apache Commons HttpClient for a slightly easier to use library.
HttpClient client = new HttpClient();
HttpMethod method = new GetMethod("http://blog.codehangover.com");
try {
client.executeMethod(method);
byte[] responseBody = method.getResponseBody();
System.out.println(new String(responseBody));
} catch (Exception e) {
e.printStackTrace();
} finally {
method.releaseConnection();
}
But the real fun comes after you get the HTML. Now you get to work with it.
7 Things to do with HTML Source in Java
1. Extract the Text from the Markup
I worked on a web application that was similar to a feed reader. One of the features that we supported was searching the pages. To do this I needed to extract the text from the markup. I use the CyberNeko HTML Parser for this task.
private String getHtmlFilteredString(Reader reader)
{
// create element remover filter
ElementRemover remover;
remover = new ElementRemover();
remover.removeElement("script");
remover.removeElement("link");
remover.removeElement("style");
remover.removeElement("CDATA");
remover.removeElement("<!--");
remover.removeElement("meta");
OutputStream stream = new ByteArrayOutputStream();
try
{
String encoding = "ISO-8859-1";
XMLDocumentFilter writer = new Writer(stream, encoding);
XMLDocumentFilter[] filters = {remover, writer};
XMLInputSource source = new XMLInputSource(null, null, null, reader, null);
XMLParserConfiguration parser = new HTMLConfiguration();
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);
parser.parse(source);
} catch (Exception e) {
e.printStacktrace();
}
String content = stream.toString().trim();
return content;
}
2. Extract Links
To find all of the links in an HTML fragment, maybe for your own spider, or to extract email addresses, you can use HtmlParser
Collection<String> links = new ArrayList<String>();
try {
URI uriLink = new URI(url);
Parser parser = new Parser();
parser.setInputHTML(htmlBody);
NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (LinkTag.class));
for (int i = 0; i < list.size (); i++){
LinkTag extracted = (LinkTag)list.elementAt(i);
String extractedLink = extracted.getLink();
links.add(extractedLink);
}
} catch (Exception e) {
e.printStackTrace();
}
3. Change Links
Using the previous code, instead of calling getLink, you can call setLink to change the href. For example, this might be used by any type of analytics software that needs to track hits.
4. Collect Email Addresses
Using the previous code, before adding the link to the Collection, chech to see if it uses the mailto protocol by calling the boolean method isMailLink()
for (int i = 0; i < list.size (); i++){
LinkTag extracted = (LinkTag)list.elementAt(i);
if (extracted.isMailLink())
{
String extractedLink = extracted.getLink();
links.add(extractedLink);
}
}
5. Collect Images
Still using HtmlParser, you can extract images by filtering on the ImageTag.
Collection<String> imageUrls = new ArrayList<String>();
try {
URI uriLink = new URI(url);
Parser parser = new Parser();
parser.setInputHTML(htmlBody);
NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (ImageTag.class));
for (int i = 0; i < list.size (); i++){
ImageTag extracted = (ImageTag)list.elementAt(i);
String extractedImageSrc = extracted.getImageUrl();
imageUrls.add(extractedImageSrc);
}
} catch (Exception e) {
e.printStackTrace();
}
6. Add Syntax Highlighting
If you’re going to present the source on screen you have to add syntax highlighting. If you’re displaying on a web page, I would highly recommend using something like SyntaxHighligher, which is what we use on this blog. If you are displaying an a Swing app, you can use a tool called Syntax which is now several years old.
try {
ToHTML toHTML = new ToHTML();
toHTML.setInput(new FileReader("Source.java"));
toHTML.setOutput(new FileWriter("Source.java.html"));
toHTML.setMimeType("text/x-java");
toHTML.setFileExt("java")
toHTML.writeFullHTML();
} catch (Exception e){
e.printStackTrace();
}
7. Diff Two Sources
Once you get the HTML source of two URLs, or you have an old version of the HTML stored somewhere, you might want a nicely formatted diff output.
Madlep wrote his own method to handle the diff and output the results over on Stackoverflow.com
Given some simple input, his code gives simple output.
String one = "" +
"<ul>" +
" <li>item 1</li>" +
" <li>item 2</li>" +
"</ul>";
String two = "" +
"<p>This is text</p>" +
"<ul>" +
" <li>item 1</li>" +
" <li>item 2</li>" +
" <li>item 3</li>" +
"</ul>";
System.out.println(diffSideBySide(one, two));
Outputs
> <p>This is text</p>
<ul> <ul>
<li>item 1</li> <li>item 1</li>
<li>item 2</li> <li>item 2</li>
> <li>item 3</li>
</ul> </ul>
3 More Things to do with the HTML
This list sure feels incomplete at only 7 of a nice even 10 things. What else could you you do with the HTML? Ill pick the top three from the comments and finish this blog post with them.
Getting friendly with Spring, JUnit and EasyMock.
Here are some steps that can get you using Spring, JUnit and EasyMock all together in some Test Driven Development hotness.
Start by adding the following lines to the top of your unit test. Specifying Autowire by name ensures you get the injection you want and will stop those Spring errors that there are more than one of the same type of mock objects in your mock-applicationContext.xml. When you specify the Spring Junit runner you must provide one ore more context configurations with @ContextConfiguration.
MyClassUnitTest.java
...
@Configurable(autowire = Autowire.BY_NAME)
@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(locations = {"classpath:com/some/domain/someproject/resources/mock-applicationContext.xml"})
public class MyClassUnitTest {
...
@Autowired private Collaborator mockCollaborator;
@Before
public void setup() throws Exception {
...
}
@After
public void teardown() throws Exception {
...
}
@Test
public void testMyClass() throws Exception {
//Set your mock behavior here
....
EasyMock.replay(mockCollaborator);
MyClass myClass = new MyClass(mockCollaborator);
myClass.run();
EasyMock.verify(mockCollaborator);
// Other JUnit assertions
...
}
}
Then add your mocks to your mock-applicationContext.xml (an applicationContext in your test resources just for providing your unit tests with spring injection dependencies):
mock-applicationContext.xml
...
<bean id="mockCollaborator" name="mockCollaborator" class="org.easymock.EasyMock" factory-method="createStrictMock">
<constructor-arg value="com.some.domain.someproject.Collaborator"/>
</bean>
<bean id="mockOtherCollaborator" name="mockOtherCollaborator" class="org.easymock.EasyMock" factory-method="createStrictMock">
<constructor-arg value="com.some.domain.someproject.OtherCollaborator"/>
</bean>
...
Ensure your maven pom has the following test-scoped dependency:
pom.xml
...
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<version>2.5.6</version>
<scope>test</scope>
</dependency>
...
Then use your mocks as normal. Make sure you call EasyMock.reset(mock) in your @After teardown() method so that each of your mocks are reset for each test. You can also tell Spring to reset the context after a test if needed with @DirtiesContext.
For official documentation and tutorials check out these links.
Spring
http://www.springsource.org/documentation
JUnit
EasyMock
http://easymock.org/Documentation.html
Load Multiple Contexts into Spring
I have already argued that many application contexts are better than a single application context. But how do you load more than one context?
There are a couple of ways to do this.
web.xml contextConfigLocation
Your first option is to load them all into your Web application context via the ContextConfigLocation element. You’re already going to have your primary applicationContext here, assuming you’re writing a web application. All you need to do is put some white space between the declaration of the next context.
<context-param>
<param-name>
contextConfigLocation
</param-name>
<param-value>
applicationContext1.xml
applicationContext2.xml
</param-value>
</context-param>
<listener>
<listener-class>
org.springframework.web.context.ContextLoaderListener
</listener-class>
</listener>
The above uses carriage returns. Alternatively, yo could just put in a space.
<context-param>
<param-name>
contextConfigLocation
</param-name>
<param-value>
applicationContext1.xml applicationContext2.xml
</param-value>
</context-param>
<listener>
<listener-class>
org.springframework.web.context.ContextLoaderListener
</listener-class>
</listener>
applicationContext.xm import resourcel
Your other option is to just add your primary applicationContext.xml to the web.xml and then use import statements in that primary context.
In applicationContext.xml you might have…
<!-- hibernate configuration and mappings --> <import resource="applicationContext-hibernate.xml"/> <!-- ldap --> <import resource="applicationContext-ldap.xml"/> <!-- aspects --> <import resource="applicationContext-aspects.xml"/>
Which strategy should you use?
I always prefer to load up via web.xml This allows me to keep all contexts isolated from each other. With tests, we can load just the contexts that we need to run those tests. This makes development more modular too as components stay loosely coupled, so that in the future I can extract a package or vertical layer and move it to its own module.
If you are loading contexts into a non-web application, I would use the import resource.
Any benefits to going with the application context import method over the web.xml contextConfigLocation?