Google PlusFacebook iconTwitter icon+44 113 260 4010 contact@branded3.com

How to Scrape Pages With ColdFusion

This is a guest post by Guy from nullamatix.com

With the exponential growth of the Internet, data harvesting has become increasingly popular in the last few years. Several web sites sell large databases of information relevant to lawyers, doctors, businesses, schools, just about anything imaginable.

After seeing all this content, I asked myself, “How is all this information compiled?” Surely some poor sap isn’t being paid to manually insert each record. With a little research, I was able to come up with a pretty simple solution using Coldfusion.

To keep things simple, we’re going to harvest data from articles-hub.com. First, open your favorite text editor and drop in the following code:

<cfhttp url="http://www.articles-hub.com/Article/700.html" method="GET">
<cfset sDoc = trim(cfhttp.fileContent)>

This tells Coldfusion to literally get the contents of the specified page, then store that content into a variable named sDoc.

The following bit of code is where the magic happens. If you’re unfamiliar with regular expressions, now is a great time to learn. Insert the following bit of code after the variable declaration mentioned above:

<cfset regExp = '<span class="article_display_title" >
        ([sS]*?)</span>[sS]*?<div align=[sS]*?
</div>
    ([sS]*?)
          </div>
            </div>'>

Without going into to much detail, this variable tells Coldfusion what to look for, and where. View the source code of the page defined above and goto line 1016. You’ll notice the span tag defined in regExp is on that line. When our application is executed, Coldfusion will begin searching sDoc for that tag. Once located, the data sitting in place of the first expression ([sS]*?) will be defined as $1, which is the article’s title. Coldfusion continues searching, and looks over everything between:

</span>[sS]*?<div align=[sS]*?</div>

until the next expression containing the actual article content is reached. Finally, our variable stops when the two consecutive </div> tags are reached.

This information should simplify the regular expression creation process. Any set of information you’re wanting to store for later, use ([sS]*?). If you’re wanting to skip over anything, use [sS]*?.

With our data sets defined, we can output the results into a nice, organized product. Drop in this code:

<cfset q_srch = queryNew("title, article")>
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_srch)>
     <cfset querySetCell(q_srch,"article",mid(sDoc,stResult.pos[3],stResult.len[3]))>
     <cfset querySetCell(q_srch,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>

The code above tells Coldfusion to create a virtual query with two columns: title and article. Next, a starting point to loop through the results is defined. The loop is then started and begins searching sDoc with the regular expression criteria defined above. Each matching result is parsed, stored in a virtual row with the respective column, and assigned unique ID. We’re now ready to test our primitive data mining application.

Here’s how our application should look as of now:

<cfhttp url="http://www.articles-hub.com/Article/700.html" method="GET">
<cfset sDoc = trim(cfhttp.fileContent)>
<cfset regExp = '<span class="article_display_title" >

        ([sS]*?)</span>[sS]*?<div align=[sS]*?
</div>
    ([sS]*?)
          </div>
            </div>'>
<cfset q_srch = queryNew("title, article")>
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_srch)>
     <cfset querySetCell(q_srch,"article",mid(sDoc,stResult.pos[3],stResult.len[3]))>
     <cfset querySetCell(q_srch,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>
<cfdump var="#q_srch#">

Go ahead and save the file as miner.cfm, or whatever you’d like, and browse to that file in your web browser. For example, http://192.168.230.239:80/miner.cfm. The article’s title and content are displayed in an organized table.

Here’s a screen shot of data harvested from a site containing US College information:

US School Data

Ok, that’s nice, but this information is totally useless unless we can dump it into a database, so here’s what we need to do.

After the </cfloop> tag, drop in a modified version of this code:

<cfquery name="insert_data" datasource="localdev">
INSERT article_dump(title,content) VALUES('#q_srch.title#','#q_srch.article#')
</cfquery>

The value of datasource is completely independent to each system – that just so happens to be the name of my datasource. After defining the appropriate datasource, you can either create a table with 3 columns (id, title, content) called article_dump, or us an already existing table. Just make sure to change the code where necessary. If you refresh miner.cfm in your browser, the data is not only displayed, but inserted into our database, too.

Let’s take this a step further, and automate the entire process. Go back to the top of miner.cfm and add the following code as the first line:

<cfloop from="500" to="5000" index="LoopCount">

Now replace 700.html on the second line with:

#LoopCount#.html

Scroll to the bottom and add a the closing cfloop tag to the last line:

</cfloop>

We just told Coldfusion to visit 500.html, 501.html, 502.html, 503.html, etc, until 5000.html is reached and insert each set of results into the database before moving onto the next. With this short piece of code, I’ve created databases with over 20,000 records in less than an hour, and now you can, too.

Here’s the entire final product:

<cfloop from="500" to="5000" index="LoopCount">
<cfhttp url="http://www.articles-hub.com/Article/#loopcount#.html" method="GET">
<cfset sDoc = trim(cfhttp.fileContent)>
<cfset regExp = '<span class="article_display_title" >

        ([sS]*?)</span>[sS]*?<div align=[sS]*?
</div>
    ([sS]*?)
          </div>
            </div>'>
<cfset q_srch = queryNew("title, article")>
<cfset start = 1>
<cfloop condition="#start#">
  <cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
  <cfif stResult.pos[1]>
     <cfset queryAddRow(q_srch)>
     <cfset querySetCell(q_srch,"article",mid(sDoc,stResult.pos[3],stResult.len[3]))>
     <cfset querySetCell(q_srch,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
  </cfif>
  <cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>
<cfquery name="insert_data" datasource="localdev">
INSERT article_dump(title,content) VALUES('#q_srch.title#','#q_srch.article#')
</cfquery>
</cfloop>

BY Patrick Altoft AT 1:03am ON Saturday, 26 January 2008

Patrick Altoft is Director of Search at Branded3 and has worked in the SEO industry for over 10 years. With experience across some of the worlds largest brands as well as startup businesses Patrick is well known in the industry and speaks regularly at the major SEO conferences and events. Follow Patrick on Twitter or Google+

Comments

  • http://www.grademoney.com Desmond

    what the heck!?

    Great info

  • http://www.blogstorm.co.uk Patrick Altoft

    I like this stuff because although it’s useless to anybody who doesn’t want to scrape with ColdFusion it is gold dust to the people who do.

  • http://fka200.com/ Sammy Ashouri

    Seems cool. Too bad I have 0 experience with coding.

  • http://www.nullamatix.com Guy Patterson

    Well said. This little bit of code has near endless potential. If you’re unfamiliar with Adobe’s coldfusion, I highly recommend the open-source, free alternative call, “The Smith Project.” Just Google that phrase and check it out.

    Setup IIS, Smith Project, and MySQL locally, and let the data harvesting begin :)

  • http://www.howardyoung.info Howard Young

    A truly brilliant script! I’ve never did anything with ColdFusion before, but the language looks very powerful.

  • Pingback: 1 Month Commission Junction Earnings Report

  • http://www.nullamatix.com Guy Patterson

    That’s the point of this tutorial.. is there something in particular you’re having difficulty with?

  • Jason

    Thanks for the great tutorial. However, I’m running through 2500 links and scraping data from each one. I have a database that holds all the URL’s and I loop through the query of those URL’s to get the data I need. However.. it eventually gets to a page where it says element at pos[2] cannot be found. So I check my RegExp’s on that page and everything runs smooth.. I run the program again and it stops on a different page. I’m thinking the page is timing out when coldfusion tries to request it.. perhaps because I’m requesting so many so fast. Any ideas?

  • http://www.nullamatix.com Guy Patterson

    Jason,

    You were on track with examining the page’s source; the error tells me CF is unable to find the title or anchor text of the link in your case? To better understand the issue and hopefully resolve it, feel free to shoot me an email: my lastname @ nullamatix.com – (reformat accordingly, obviously).

    Blogstorm readers fear not; I or Jason will follow-up with the solution (minus non-relevant details) once we’ve figured out a solution. I just wasn’t sure if Jason was comfortable having an open discussion here (or even via email) regarding his scraping project :P

    -Guy

  • Pingback: Scraping Google SERPs with ColdFusion

  • Pingback: Search Engine Optimization » Scraping Google SERPs with ColdFusion

  • http://websauce.net b sizzle

    thats what I’m talking about. thanks man !

  • Pingback: Howto: Insert Bash Command Output Into MySQL @ nullamatix.com - Technology Made Simple

  • Pingback: Howto: Insert Bash Command Output Into MySQL

  • antonio

    Great tutorial ! Thank you very much! ;)

  • Misty

    Hi can u please if we need to fetch out the tables, then how we can do!

  • Marc Williams Jr

    This seems like a great script and we are reviewing it now. The question i have is that i need to scrape through more then one page. here is how it works
    -I enter search criteria
    -a page is returned with 25 of a possible 1000 results. I need them all
    -I need to not only go through all of the 10 pages(100 per page) i need to click through each link to get more information
    -The results page has the name and Id of what i need
    -The link to another page has the email address
    -i need all three elements to complete my data mining.

    i noticed this was from 2009 but i am optimistic. thank you.

  • Justin

    Easy to work with and modify.

    Thanks.