webmastering rantish time - by din on 12:22 15 Apr 2003
Couldn't think of a better title ..
as some of you know, i have a history website. part of this site is built around a program i wrote that involves downloading articles from http://wikipedia.org/ , some of the articles are a little dodgy, but they are free and 'open content.' For the most part i do not pull content from other sites, but i do so in this case only on demand and write the articles to my server, so the next person wanting the same article will not be pulling it from wiki, but from me.
there lies the problem ..
over the last two days somebody ran a spidering .. something or other.. through my site. now most spiders obey the rules laid out in the metas nicely, but this one did not and in the course of a 24 hour+ period spidered every page in my site and pulled nearly 10,000 articles from wikipedia.
it did this because in each article there are links to supporting articles, forming a high level of interconnectedness, my program rewrites these links as i get the articles so that as a person surfs them and follows the links they would download these new articles as well.
I had an article limit set up, which worked with normal traffic, but this thing was requesting pages so quickly (again, i must note, in a place that non-human surfing should not be taking place) that the counter system could not function correctly.
luckily, when i wrote the program i put in a sort of manual override, so i can turn the part that gets new articles off while still keeping that section of the site alive. I've also have a decent idea on a couple of ways to stop this sort of thing in the program itself and through .htaccess.
The actual 'damage' was impressive, and it is interesting to see my simple program working well at a slightly larger scale, even if it is to my detriment. On the average i transfer 87 MB of data a day (as per the first week of april). That changed to about 580 MB a day for the two days involved and the cache became quite large, maybe around 200 MB.
so that was my sort of ranting .. the bigger the site becomes the more of this junk traffic i'm getting and the more attention has to be paid to internal server issues like banning IPs and all kinds of technical goop ..
Couldn't think of a better title ..
as some of you know, i have a history website. part of this site is built around a program i wrote that involves downloading articles from http://wikipedia.org/ , some of the articles are a little dodgy, but they are free and 'open content.' For the most part i do not pull content from other sites, but i do so in this case only on demand and write the articles to my server, so the next person wanting the same article will not be pulling it from wiki, but from me.
there lies the problem ..
over the last two days somebody ran a spidering .. something or other.. through my site. now most spiders obey the rules laid out in the metas nicely, but this one did not and in the course of a 24 hour+ period spidered every page in my site and pulled nearly 10,000 articles from wikipedia.
it did this because in each article there are links to supporting articles, forming a high level of interconnectedness, my program rewrites these links as i get the articles so that as a person surfs them and follows the links they would download these new articles as well.
I had an article limit set up, which worked with normal traffic, but this thing was requesting pages so quickly (again, i must note, in a place that non-human surfing should not be taking place) that the counter system could not function correctly.
luckily, when i wrote the program i put in a sort of manual override, so i can turn the part that gets new articles off while still keeping that section of the site alive. I've also have a decent idea on a couple of ways to stop this sort of thing in the program itself and through .htaccess.
The actual 'damage' was impressive, and it is interesting to see my simple program working well at a slightly larger scale, even if it is to my detriment. On the average i transfer 87 MB of data a day (as per the first week of april). That changed to about 580 MB a day for the two days involved and the cache became quite large, maybe around 200 MB.
so that was my sort of ranting .. the bigger the site becomes the more of this junk traffic i'm getting and the more attention has to be paid to internal server issues like banning IPs and all kinds of technical goop ..
webmastering rantish time - by Arislyn on 12:30 15 Apr 2003
I'm sorry that you're having fits over there, din. However, I'm glad you posted that. There may be other people out there who may run into similiar problems. I never even thought about anything like that happening...
I'm sorry that you're having fits over there, din. However, I'm glad you posted that. There may be other people out there who may run into similiar problems. I never even thought about anything like that happening...