<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Code Hangover &#187; parse</title>
	<atom:link href="http://blog.codehangover.com/tag/parse/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.codehangover.com</link>
	<description>Go ahead, have another</description>
	<lastBuildDate>Tue, 22 Mar 2011 15:49:48 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Read HTML with Java &#8211; Then 7 Fun Things to do to It</title>
		<link>http://blog.codehangover.com/read-html-with-java-then-7-fun-things-to-do-to-it/</link>
		<comments>http://blog.codehangover.com/read-html-with-java-then-7-fun-things-to-do-to-it/#comments</comments>
		<pubDate>Tue, 04 Aug 2009 02:57:10 +0000</pubDate>
		<dc:creator>MikeNereson</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[parse]]></category>

		<guid isPermaLink="false">http://blog.codehangover.com/?p=214</guid>
		<description><![CDATA[There are several ways to get the HTML content of a URL from Java. There are even more ways to get the HTML using open source java libraries. (read html in java) (java get html source)]]></description>
			<content:encoded><![CDATA[<script type="text/javascript">dzone_url = "http://blog.codehangover.com/read-html-with-java-then-7-fun-things-to-do-to-it/";</script><p>There are several ways to get the HTML content of a URL from Java. There are even more ways to get the HTML using open source java libraries. Last week Lars Vogel shared <a href="http://www.vogella.de/articles/JavaNetworking/article.html" target="_blank">how to get the HTML using nothing but the SDK</a>.</p>
<h2>SDK</h2>
<pre class="brush: java;">
final URL url = new URL(&quot;http://blog.codehangover.com&quot;);
final InputStream inputStream = new InputStreamReader(url);
final BufferedReader reader
             = new BufferedReader(inputStream).openStream();

String line;

while ((line = reader.readLine()) != null) {
   System.out.println(line);
}

reader.close();
</pre>
<h2>Apache Commons HttpClient</h2>
<p>You can also use the <a href="http://hc.apache.org/httpclient-3.x/" target="_blank">Apache Commons HttpClient</a> for a slightly easier to use library.</p>
<pre class="brush: java;">
HttpClient client = new HttpClient();
HttpMethod method = new GetMethod(&quot;http://blog.codehangover.com&quot;);

try {
 client.executeMethod(method);

 byte[] responseBody = method.getResponseBody();

 System.out.println(new String(responseBody));

} catch (Exception e) {

 e.printStackTrace();

} finally {

 method.releaseConnection();
}
</pre>
<p>But the real fun comes after you get the HTML. Now you get to work with it.</p>
<h2>7 Things to do with HTML Source in Java</h2>
<h4>1. Extract the Text from the Markup</h4>
<p>I worked on a web application that was similar to a feed reader. One of the features that we supported was searching the pages. To do this I needed to extract the text from the markup. I use the <a href="http://nekohtml.sourceforge.net" target="_blank">CyberNeko HTML Parser </a>for this task.</p>
<pre class="brush: java;">
private String getHtmlFilteredString(Reader reader)
{

  // create element remover filter
  ElementRemover remover;
  remover = new ElementRemover();
  remover.removeElement(&quot;script&quot;);
  remover.removeElement(&quot;link&quot;);
  remover.removeElement(&quot;style&quot;);
  remover.removeElement(&quot;CDATA&quot;);
  remover.removeElement(&quot;&lt;!--&quot;);
  remover.removeElement(&quot;meta&quot;);

  OutputStream stream = new ByteArrayOutputStream();

  try
  {
    String encoding = &quot;ISO-8859-1&quot;;
    XMLDocumentFilter writer = new Writer(stream, encoding);

    XMLDocumentFilter[] filters = {remover, writer};

    XMLInputSource source = new XMLInputSource(null, null, null, reader, null);

    XMLParserConfiguration parser = new HTMLConfiguration();
    parser.setProperty(&quot;http://cyberneko.org/html/properties/filters&quot;, filters);

    parser.parse(source);

  } catch (Exception e) {

    e.printStacktrace();
  }

  String content = stream.toString().trim();

  return content;
}
</pre>
<h4>2. Extract Links</h4>
<p>To find all of the links in an HTML fragment, maybe for your own spider, or to extract email addresses, you can use <a href="http://htmlparser.sourceforge.net" target="_blank">HtmlParser</a></p>
<pre class="brush: java;">

Collection&lt;String&gt; links = new ArrayList&lt;String&gt;();

try {

  URI uriLink = new URI(url);
  Parser parser = new Parser();
  parser.setInputHTML(htmlBody);
  NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (LinkTag.class));

  for (int i = 0; i &lt; list.size (); i++){
    LinkTag extracted = (LinkTag)list.elementAt(i);
    String extractedLink = extracted.getLink();
    links.add(extractedLink);
  }

} catch (Exception e) {

  e.printStackTrace();
}
</pre>
<h4>3. Change Links</h4>
<p>Using the previous code, instead of calling <tt>getLink</tt>, you can call <tt>setLink</tt> to change the href. For example, this might be used by any type of analytics software that needs to track hits.</p>
<h4>4. Collect Email Addresses</h4>
<p>Using the previous code, before adding the link to the Collection, chech to see if it uses the mailto protocol by calling the <tt>boolean</tt> method <tt>isMailLink()</tt></p>
<pre class="brush: java;">
  for (int i = 0; i &lt; list.size (); i++){

    LinkTag extracted = (LinkTag)list.elementAt(i);

    if (extracted.isMailLink())
    {
      String extractedLink = extracted.getLink();
      links.add(extractedLink);
    }
  }
</pre>
<h4>5. Collect Images</h4>
<p>Still using <a href="http://htmlparser.sourceforge.net/" target="_blank">HtmlParser</a>, you can extract images by filtering on the <tt>ImageTag</tt>.</p>
<pre class="brush: java;">

Collection&lt;String&gt; imageUrls = new ArrayList&lt;String&gt;();

try {

  URI uriLink = new URI(url);
  Parser parser = new Parser();
  parser.setInputHTML(htmlBody);
  NodeList list = parser.extractAllNodesThatMatch(new NodeClassFilter (ImageTag.class));

  for (int i = 0; i &lt; list.size (); i++){
    ImageTag extracted = (ImageTag)list.elementAt(i);
    String extractedImageSrc = extracted.getImageUrl();
    imageUrls.add(extractedImageSrc);
  }

} catch (Exception e) {

  e.printStackTrace();
}
</pre>
<h4>6. Add Syntax Highlighting</h4>
<p>If you&#8217;re going to present the source on screen you <em>have</em> to add syntax highlighting. If you&#8217;re displaying on a web page, I would highly recommend using something like <a href="http://code.google.com/p/syntaxhighlighter/">SyntaxHighligher</a>, which is what we use on this blog. If you are displaying an a Swing app, you can use a tool called  <a href="http://ostermiller.org/syntax/" target="_blank">Syntax</a> which is now several years old.</p>
<pre class="brush: java;">
try {

     ToHTML toHTML = new ToHTML();

     toHTML.setInput(new FileReader(&quot;Source.java&quot;));
     toHTML.setOutput(new FileWriter(&quot;Source.java.html&quot;));
     toHTML.setMimeType(&quot;text/x-java&quot;);
     toHTML.setFileExt(&quot;java&quot;)

     toHTML.writeFullHTML();

 } catch (Exception e){

    e.printStackTrace();
 }
</pre>
<h4>7. Diff Two Sources</h4>
<p>Once you get the HTML source of two URLs, or you have an old version of the HTML stored somewhere, you might want a nicely formatted diff output.</p>
<p><a href="http://stackoverflow.com/users/14160/madlep" target="_blank">Madlep</a> wrote his own method to handle the diff and output the results <a href="http://stackoverflow.com/questions/319479/generate-formatted-diff-output-in-java/319857#319857" target="_blank">over on Stackoverflow.com</a></p>
<p>Given some simple input, his code gives simple output.</p>
<pre class="brush: java;">

String one = &quot;&quot; +
    &quot;&lt;ul&gt;&quot; +
    &quot;  &lt;li&gt;item 1&lt;/li&gt;&quot; +
    &quot;  &lt;li&gt;item 2&lt;/li&gt;&quot; +
    &quot;&lt;/ul&gt;&quot;;

String two = &quot;&quot; +
    &quot;&lt;p&gt;This is text&lt;/p&gt;&quot; +
    &quot;&lt;ul&gt;&quot; +
    &quot;  &lt;li&gt;item 1&lt;/li&gt;&quot; +
    &quot;  &lt;li&gt;item 2&lt;/li&gt;&quot; +
    &quot;  &lt;li&gt;item 3&lt;/li&gt;&quot; +
    &quot;&lt;/ul&gt;&quot;;

System.out.println(diffSideBySide(one, two));
</pre>
<p>Outputs</p>
<pre>                     <strong>&gt;</strong>  &lt;p&gt;This is text&lt;/p&gt;
&lt;ul&gt;                    &lt;ul&gt;
 &lt;li&gt;item 1&lt;/li&gt;          &lt;li&gt;item 1&lt;/li&gt;
 &lt;li&gt;item 2&lt;/li&gt;          &lt;li&gt;item 2&lt;/li&gt;
                     <strong>&gt;</strong>    &lt;li&gt;item 3&lt;/li&gt;
&lt;/ul&gt;                   &lt;/ul&gt;</pre>
<h2>3 More Things to do with the HTML</h2>
<p>This list sure feels incomplete at only 7 of a nice even 10 things. What else could you you do with the HTML? Ill pick the top three from the comments and finish this blog post with them.</p>
<p>(read html in java) (java get html source)<br />
<h3>Related Posts</h3>
<ul class="related_post">
<li><a href="http://blog.codehangover.com/php-framework-comparison/" title="PHP Framework Comparison">PHP Framework Comparison</a></li>
<li><a href="http://blog.codehangover.com/read-html-with-java-then-7-fun-things-to-do-to-it/" title="Read HTML with Java &#8211; Then 7 Fun Things to do to It">Read HTML with Java &#8211; Then 7 Fun Things to do to It</a></li>
<li><a href="http://blog.codehangover.com/list-of-version-control-web-sites/" title="List of Version Control Web Sites">List of Version Control Web Sites</a></li>
</ul>
<script>var dzone_style="2";</script><script language="javascript" src="http://widgets.dzone.com/widgets/zoneit.js"></script><a class="a2a_dd addtoany_share_save" href="http://www.addtoany.com/share_save?linkurl=http%3A%2F%2Fblog.codehangover.com%2Fread-html-with-java-then-7-fun-things-to-do-to-it%2F&amp;linkname=Read%20HTML%20with%20Java%20%26%238211%3B%20Then%207%20Fun%20Things%20to%20do%20to%20It"><img src="http://blog.codehangover.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share/Bookmark"/></a>]]></content:encoded>
			<wfw:commentRss>http://blog.codehangover.com/read-html-with-java-then-7-fun-things-to-do-to-it/feed/</wfw:commentRss>
		<slash:comments>16</slash:comments>
		</item>
	</channel>
</rss>

