WebPagetest Forums
honor robots.txt - Printable Version

+- WebPagetest Forums (https://www.webpagetest.org/forums)
+-- Forum: WebPagetest (/forumdisplay.php?fid=7)
+--- Forum: Feature Suggestions (/forumdisplay.php?fid=9)
+--- Thread: honor robots.txt (/showthread.php?tid=584)



honor robots.txt - jared - 03-15-2011 01:56 PM

Hey guys,

I'm working on httparchive, and one of the current bugs is to make it honor robots.txt--which really is an upstream bug with wpt.

I'm not familiar with the wpt codebase, but I'd be happy to try to contribute, if altering the spidering call to respect robots.txt seems relatively straightforward to someone familiar with the codebase. (I'm guessing the actual spidering is executed via a php curl extension call?)

If supporting robots.txt would be tough, I'll just handle the spidering step in httparchive for now.

Thanks!

Jared Hirsch


RE: honor robots.txt - pmeenan - 03-15-2011 11:03 PM

Sorry, I'm a little confused because wpt doesn't spider. It only knows how to load individual pages as they are requested. I wouldn't want to have wpt read robots.txt as if it were a bot because a lot of pages would be untestable. I'd expect you'd want to put the robots.txt logic wherever the spidering is being done.

If you're talking about the project I think you are, last time I checked it didn't spider either, it worked off of a list of pages from various "top X" lists. If the spidering is a new capability being added then that's probably where the logic belongs (though there are things wpt can do to help with just a little work - for example, dumping a list of links as part of the data returned about a page).

Thanks,

-Pat


RE: honor robots.txt - jared - 03-16-2011 05:44 AM

Hey Pat,

No worries, I'm the one who's confused--I thought WPT was traversing the sites on the lists. Definitely can do the spidering in httparchive.

I'll follow up w/Steve. Thanks!

Jared

(03-15-2011 11:03 PM)pmeenan Wrote:  Sorry, I'm a little confused because wpt doesn't spider. It only knows how to load individual pages as they are requested. I wouldn't want to have wpt read robots.txt as if it were a bot because a lot of pages would be untestable. I'd expect you'd want to put the robots.txt logic wherever the spidering is being done.

If you're talking about the project I think you are, last time I checked it didn't spider either, it worked off of a list of pages from various "top X" lists. If the spidering is a new capability being added then that's probably where the logic belongs (though there are things wpt can do to help with just a little work - for example, dumping a list of links as part of the data returned about a page).

Thanks,

-Pat