Current time: 05-19-2019, 10:53 PM Hello There, Guest! (LoginRegister)

Post Reply 
 
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
honor robots.txt
03-15-2011, 01:56 PM
Post: #1
honor robots.txt
Hey guys,

I'm working on httparchive, and one of the current bugs is to make it honor robots.txt--which really is an upstream bug with wpt.

I'm not familiar with the wpt codebase, but I'd be happy to try to contribute, if altering the spidering call to respect robots.txt seems relatively straightforward to someone familiar with the codebase. (I'm guessing the actual spidering is executed via a php curl extension call?)

If supporting robots.txt would be tough, I'll just handle the spidering step in httparchive for now.

Thanks!

Jared Hirsch
Find all posts by this user
Quote this message in a reply
03-15-2011, 11:03 PM
Post: #2
RE: honor robots.txt
Sorry, I'm a little confused because wpt doesn't spider. It only knows how to load individual pages as they are requested. I wouldn't want to have wpt read robots.txt as if it were a bot because a lot of pages would be untestable. I'd expect you'd want to put the robots.txt logic wherever the spidering is being done.

If you're talking about the project I think you are, last time I checked it didn't spider either, it worked off of a list of pages from various "top X" lists. If the spidering is a new capability being added then that's probably where the logic belongs (though there are things wpt can do to help with just a little work - for example, dumping a list of links as part of the data returned about a page).

Thanks,

-Pat
Visit this user's website Find all posts by this user
Quote this message in a reply
03-16-2011, 05:44 AM
Post: #3
RE: honor robots.txt
Hey Pat,

No worries, I'm the one who's confused--I thought WPT was traversing the sites on the lists. Definitely can do the spidering in httparchive.

I'll follow up w/Steve. Thanks!

Jared

(03-15-2011 11:03 PM)pmeenan Wrote:  Sorry, I'm a little confused because wpt doesn't spider. It only knows how to load individual pages as they are requested. I wouldn't want to have wpt read robots.txt as if it were a bot because a lot of pages would be untestable. I'd expect you'd want to put the robots.txt logic wherever the spidering is being done.

If you're talking about the project I think you are, last time I checked it didn't spider either, it worked off of a list of pages from various "top X" lists. If the spidering is a new capability being added then that's probably where the logic belongs (though there are things wpt can do to help with just a little work - for example, dumping a list of links as part of the data returned about a page).

Thanks,

-Pat
Find all posts by this user
Quote this message in a reply
Post Reply 


Forum Jump:


User(s) browsing this thread: 1 Guest(s)