How to Scrape Pages With ColdFusion

  • 0
  • January 26, 2008
Patrick Altoft

Patrick Altoft

Director of Strategy

This is a guest post by Guy from

With the exponential growth of the Internet, data harvesting has become increasingly popular in the last few years. Several web sites sell large databases of information relevant to lawyers, doctors, businesses, schools, just about anything imaginable.

After seeing all this content, I asked myself, “How is all this information compiled?” Surely some poor sap isn’t being paid to manually insert each record. With a little research, I was able to come up with a pretty simple solution using Coldfusion.

To keep things simple, we’re going to harvest data from First, open your favorite text editor and drop in the following code:

This tells Coldfusion to literally get the contents of the specified page, then store that content into a variable named sDoc.

The following bit of code is where the magic happens. If you’re unfamiliar with regular expressions, now is a great time to learn. Insert the following bit of code after the variable declaration mentioned above:

Without going into to much detail, this variable tells Coldfusion what to look for, and where. View the source code of the page defined above and goto line 1016. You’ll notice the span tag defined in regExp is on that line. When our application is executed, Coldfusion will begin searching sDoc for that tag. Once located, the data sitting in place of the first expression ([sS]*?) will be defined as $1, which is the article’s title. Coldfusion continues searching, and looks over everything between:

</span>[sS]*?<div align=[sS]*?</div>

until the next expression containing the actual article content is reached. Finally, our variable stops when the two consecutive </div> tags are reached.

This information should simplify the regular expression creation process. Any set of information you’re wanting to store for later, use ([sS]*?). If you’re wanting to skip over anything, use [sS]*?.

With our data sets defined, we can output the results into a nice, organized product. Drop in this code:

The code above tells Coldfusion to create a virtual query with two columns: title and article. Next, a starting point to loop through the results is defined. The loop is then started and begins searching sDoc with the regular expression criteria defined above. Each matching result is parsed, stored in a virtual row with the respective column, and assigned unique ID. We’re now ready to test our primitive data mining application.

Here’s how our application should look as of now:

Go ahead and save the file as miner.cfm, or whatever you’d like, and browse to that file in your web browser. For example, The article’s title and content are displayed in an organized table.

Here’s a screen shot of data harvested from a site containing US College information:

US School Data

Ok, that’s nice, but this information is totally useless unless we can dump it into a database, so here’s what we need to do.

After the </cfloop> tag, drop in a modified version of this code:

The value of datasource is completely independent to each system – that just so happens to be the name of my datasource. After defining the appropriate datasource, you can either create a table with 3 columns (id, title, content) called article_dump, or us an already existing table. Just make sure to change the code where necessary. If you refresh miner.cfm in your browser, the data is not only displayed, but inserted into our database, too.

Let’s take this a step further, and automate the entire process. Go back to the top of miner.cfm and add the following code as the first line:

Now replace 700.html on the second line with:

Scroll to the bottom and add a the closing cfloop tag to the last line:

We just told Coldfusion to visit 500.html, 501.html, 502.html, 503.html, etc, until 5000.html is reached and insert each set of results into the database before moving onto the next. With this short piece of code, I’ve created databases with over 20,000 records in less than an hour, and now you can, too.

Here’s the entire final product:

Free of charge. Unsubscribe anytime.