Config Router

  • Google Sheets
  • CCNA Online training
    • CCNA
  • CISCO Lab Guides
    • CCNA Security Lab Manual With Solutions
    • CCNP Route Lab Manual with Solutions
    • CCNP Switch Lab Manual with Solutions
  • Juniper
  • Linux
  • DevOps Tutorials
  • Python Array
You are here: Home / How do I make a simple crawler in PHP? [closed]

How do I make a simple crawler in PHP? [closed]

August 23, 2021 by James Palmer

Meh. Don’t parse HTML with regexes.
Here’s a DOM version inspired by Tatu’s:
loadHTMLFile($url);

$anchors = $dom->getElementsByTagName(‘a’);
foreach ($anchors as $element) {
$href = $element->getAttribute(‘href’);
if (0 !== strpos($href, ‘http’)) {
$path = ‘/’ . ltrim($href, ‘/’);
if (extension_loaded(‘http’)) {
$href = http_build_url($url, array(‘path’ => $path));
} else {
$parts = parse_url($url);
$href = $parts[‘scheme’] . ‘://’;
if (isset($parts[‘user’]) && isset($parts[‘pass’])) {
$href .= $parts[‘user’] . ‘:’ . $parts[‘pass’] . ‘@’;
}
$href .= $parts[‘host’];
if (isset($parts[‘port’])) {
$href .= ‘:’ . $parts[‘port’];
}
$href .= dirname($parts[‘path’], 1).$path;
}
}
crawl_page($href, $depth – 1);
}
echo “URL:”,$url,PHP_EOL,”CONTENT:”,PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page(“http://hobodave.com”, 2);

Edit: I fixed some bugs from Tatu’s version (works with relative URLs now).
Edit: I added a new bit of functionality that prevents it from following the same URL twice.
Edit: echoing output to STDOUT now so you can redirect it to whatever file you want
Edit: Fixed a bug pointed out by George in his answer. Relative urls will no longer append to the end of the url path, but overwrite it. Thanks to George for this. Note that George’s answer doesn’t account for any of: https, user, pass, or port. If you have the http PECL extension loaded this is quite simply done using http_build_url. Otherwise, I have to manually glue together using parse_url. Thanks again George.

Here my implementation based on the above example/answer.

It is class based
uses Curl
support HTTP Auth
Skip Url not belonging to the base domain
Return Http header Response Code for each page
Return time for each page

CRAWL CLASS:
class crawler
{
protected $_url;
protected $_depth;
protected $_host;
protected $_useHttpAuth = false;
protected $_user;
protected $_pass;
protected $_seen = array();
protected $_filter = array();

public function __construct($url, $depth = 5)
{
$this->_url = $url;
$this->_depth = $depth;
$parse = parse_url($url);
$this->_host = $parse[‘host’];
}

protected function _processAnchors($content, $url, $depth)
{
$dom = new DOMDocument(‘1.0’);
@$dom->loadHTML($content);
$anchors = $dom->getElementsByTagName(‘a’);

foreach ($anchors as $element) {
$href = $element->getAttribute(‘href’);
if (0 !== strpos($href, ‘http’)) {
$path = ‘/’ . ltrim($href, ‘/’);
if (extension_loaded(‘http’)) {
$href = http_build_url($url, array(‘path’ => $path));
} else {
$parts = parse_url($url);
$href = $parts[‘scheme’] . ‘://’;
if (isset($parts[‘user’]) && isset($parts[‘pass’])) {
$href .= $parts[‘user’] . ‘:’ . $parts[‘pass’] . ‘@’;
}
$href .= $parts[‘host’];
if (isset($parts[‘port’])) {
$href .= ‘:’ . $parts[‘port’];
}
$href .= $path;
}
}
// Crawl only link that belongs to the start domain
$this->crawl_page($href, $depth – 1);
}
}

protected function _getContent($url)
{
$handle = curl_init($url);
if ($this->_useHttpAuth) {
curl_setopt($handle, CURLOPT_HTTPAUTH, CURLAUTH_ANY);
curl_setopt($handle, CURLOPT_USERPWD, $this->_user . “:” . $this->_pass);
}
// follows 302 redirect, creates problem wiht authentication
// curl_setopt($handle, CURLOPT_FOLLOWLOCATION, TRUE);
// return the content
curl_setopt($handle, CURLOPT_RETURNTRANSFER, TRUE);

/* Get the HTML or whatever is linked in $url. */
$response = curl_exec($handle);
// response total time
$time = curl_getinfo($handle, CURLINFO_TOTAL_TIME);
/* Check for 404 (file not found). */
$httpCode = curl_getinfo($handle, CURLINFO_HTTP_CODE);

curl_close($handle);
return array($response, $httpCode, $time);
}

protected function _printResult($url, $depth, $httpcode, $time)
{
ob_end_flush();
$currentDepth = $this->_depth – $depth;
$count = count($this->_seen);
echo “N::$count,CODE::$httpcode,TIME::$time,DEPTH::$currentDepth URL::$url
“;
ob_start();
flush();
}

protected function isValid($url, $depth)
{
if (strpos($url, $this->_host) === false
|| $depth === 0
|| isset($this->_seen[$url])
) {
return false;
}
foreach ($this->_filter as $excludePath) {
if (strpos($url, $excludePath) !== false) {
return false;
}
}
return true;
}

public function crawl_page($url, $depth)
{
if (!$this->isValid($url, $depth)) {
return;
}
// add to the seen URL
$this->_seen[$url] = true;
// get Content and Return Code
list($content, $httpcode, $time) = $this->_getContent($url);
// print Result for current Page
$this->_printResult($url, $depth, $httpcode, $time);
// process subPages
$this->_processAnchors($content, $url, $depth);
}

public function setHttpAuth($user, $pass)
{
$this->_useHttpAuth = true;
$this->_user = $user;
$this->_pass = $pass;
}

public function addFilterPath($path)
{
$this->_filter[] = $path;
}

public function run()
{
$this->crawl_page($this->_url, $this->_depth);
}
}

USAGE:
// USAGE
$startURL = ‘http://YOUR_URL/’;
$depth = 6;
$username = ‘YOURUSER’;
$password = ‘YOURPASS’;
$crawler = new crawler($startURL, $depth);
$crawler->setHttpAuth($username, $password);
// Exclude path with the following structure to be processed
$crawler->addFilterPath(‘customer/account/login/referer’);
$crawler->run();

Related

Filed Under: Uncategorized

Recent Posts

  • How do I give user access to Jenkins?
  • What is docker volume command?
  • What is the date format in Unix?
  • What is the difference between ARG and ENV Docker?
  • What is rsync command Linux?
  • How to Add Music to Snapchat 2021 Android? | How to Search, Add, Share Songs on Snapchat Story?
  • How to Enable Snapchat Notifications for Android & iPhone? | Steps to Turn on Snapchat Bitmoji Notification
  • Easy Methods to Fix Snapchat Camera Not Working Black Screen Issue | Reasons & Troubleshooting Tips to Solve Snapchat Camera Problems
  • Detailed Procedure for How to Update Snapchat on iOS 14 for Free
  • What is Snapchat Spotlight Feature? How to Make a Spotlight on Snapchat?
  • Snapchat Hack Tutorial 2021: Can I hack a Snapchat Account without them knowing?

Copyright © 2025 · News Pro Theme on Genesis Framework · WordPress · Log in