Coding - PHP

Using PHP and cURL to Scrape Web Pages

This is part of a continuing series on PHP programming. If you are brand new to PHP and want a more basic tutorial, check out our Introduction to PHP, then come back and complete this one. If you feel like you have the basics down, let’s jump right in. Today, we’re going to use cURL and PHP to scrape a website for data, specifically the list of most often downloaded ebooks at Project Gutenberg.

What is cURL?

I know it looks like a typo from leaving caps lock on, but that’s really how you write it. cURL is the system used to make HTTP requests with PHP. To explain it with fewer acronyms, it’s basically a way of calling web pages from within your script. cURL can be incredibly powerful if you know how to use it right. We’re going to use it to scrape data from a website and use it for our own (perfectly innocent) purposes.

A Note About Scraping

As I previously mentioned in my WordPress Email Plugins post, with great programming power comes great responsibility. This lesson is going to show you how to scrape a website using cURL. That doesn’t mean that it’s always the best idea to do so. Here is a simple guideline to keep in mind if you are going to scrape a website:

Scrape Data, Not Content
cURL and web scraping are powerful tools that can be used to automate what would otherwise be somewhat soul-crushing repetitive tasks. They are also sometimes used for more nefarious purposes, like copying entire blog posts and articles from one site and placing them on another. Don’t do this. Stick to scraping information, not full articles and content that someone worked hard to write.

And of course, if the website says anything about the privacy of their data or keeps it behind a password or paywall of any kind, it’s probably not the best idea to scrape it. Use your best judgement. That’s all for the lecture; let’s get to the code.

The Basic cURL Code

In a new PHP file (let’s call it curltest.php), enter the following code to initialize and run your first basic cURL project:


function curl_download($Url){
    if (!function_exists('curl_init')){
        die('cURL is not installed. Install and try again.');
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $Url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $output = curl_exec($ch);
    return $output;

print curl_download('');


There are two options set in this code that you should be aware of. The first one is pretty straightforward: the variable CURLOPT_URL is used to set the URL we will be scraping. The second, CURLOPT_RETURNTRANSFER tells cURL to store the scraped page in a variable, rather than its default, which is to simply display the entire page as it is. We’ll need this later on.

Other than that, there are three cURL functions at play here, all of which are named in extremely helpful ways: curl_init() initializes the session, curl_exec() executes, and curl_close() ends the session.

Open this file in a browser, and you will see one of two things. If you see the “is not installed” message, then you or your host will need to install the cURL module to PHP. If you’re hosting your own site, there are instructions on installation here. If you see a somewhat unformatted web page show up with lists of books, we’re in business.

I Ate the Whole Thing

Generally, though, we don’t want an entire web page. There are a few legitimate reasons you might, but let’s assume you’re looking to scrape a particular section of the page. In this case, we’re looking for that first list of books, the daily-updated list of the 100 most popular ebooks downloaded from Gutenberg. What we have in the $output variable in the code above is basically a string with the entire HTML code of the page as its value. We can use some basic string manipulation to cut it up and make it easier to deal with. Remember that CURLOPT_RETURNTRANSFER variable? This is where it comes into play.

Before returning the $output variable, add this code to the function:

  $start = strpos($output, '<h2 id="books-last1">Top 100 EBooks yesterday</h2>');
  $end = strpos($output, '<p>', $start);
  $length = $end-$start;
  $output = substr($output, $start, $length);

Now refresh the file in your browser, and you should see only the list of 100 movies that we were looking for. It should look something like this:


What’s Next?

Now that you have a list of data, scraped from another website, what can you do with it? That’s really up to you. You could parse apart the list and enter it into a database, logging the top books every day for a year to look for trends. You could create a graph of the number of downloads. The point is, once the data is in your hands rather than out there on the web, you can do whatever you want. Take this powerful tool, and use it wisely. Happy programming.

About the author

Ian Rose is a web developer, blogger, and writer living in Portland, OR. In his day job, he develops WordPress plugins and custom PHP solutions, focusing on nonprofit clients. By night, he attempts to write both fiction and nonfiction. Ian's site is Seaworthy Web Solutions

Share this post


  1. P G

    Strpos? You’ve got a lot to learn about scraping my friend :)

  2. Mika

    I am doing basically the same thing at the moment. I had to implement 2-phase curl script, because content I really want is behind login. I have been happy with this module when handling the dom, but probably BeautifulSoup4 would have been good also.

  3. gagan

    exactly what i wanted.

  4. Tim

    Works Awesome!

    Quick question though, how to I get the out put into an array so that it’s easier to use?
    I want to parse the output into individual variables, but I’m stuck

    Thanks again

  5. Denise

    Desperate to find someone who can write scrappers php….sorry, I kno, you’ re not a Craigslist…But if you offer any kind of referrals, would appreciate it. My programmer passed away recently. Long term project pays about 1200/month. Denise

  6. Nasib


  7. Ray

    Glad it works for you all, but all I’m getting is:

    “Object moved to here.” with a link to another page.

    Is there any way around it please?

Leave a Comment

Subscribe To Our Newsletter