172 and one-half ways Google Form Spidering Is Kewl
Can We AJAX In Some CAPTCHAs?
The one-year old decided to go five rounds with the big pine toy box Saturday night, and after the cut man emergency room doc assured us the damage was not permanent, the family settled in for the the final chocolate milks (Nestquik), Dr. Pepper, and whine wine for the evening.
Then I noticed my email in-box. My rather full in-box. I wasn’t surprised to get email from the addresses I saw - just generally never on the weekend. And they all had the same strange combination of words in them:
Google, Spider, Forms, Help, Explain!!!!!!
And one even asking if the best way to stop Google was to AJAX in a CAPTCHA. At first, I thought this was about some brand new Google spider.
I’ve been to one world fair, a picnic, and a rodeo, and that’s the stupidest thing I ever heard come over a set of earphones.
I’m not sure why, but every time I imagine SEO discussions around Google I see scenes from Dr. Strangelove. Some guy rolling around stiffly saying “it is not only possible, Mr. President. It is essential to instill the fear of cloaked content into the enemy!” I always feel just a bit disconnected when the keyword sprinkling and meta tag weighting and link baiting conversations begin. (And for the record, SEO + Yahoo conversations take me to the same movie - the air base scenes with the general raving about precious bodily fluids. I’m a water man myself).
The good folks over at Read/Write Web were on top of it, of course. Seems that Googlebot was now crawling and submitting certain types of web forms. And of course, ways to block or otherwise defeat this were popping up all over.
Dammit Jim, I’m a Programmer, not a miracle worker!
So I started responding to a few emails I’d gotten on this. As a champion of CSS -based methods of defeating form spam (modified from the SANS approach) rather than CAPTCHAs, the first questions I responded to were straightforward - Googlebot was no smarter than any other bot and would submit the trigger form elements and appear like any other bot. So would be treated to the spammer-targeted form response rather than the real form response. However, since it was only submitting forms where the method was GET, this shouldn’t affect too much. I suggested changing the “spammer” form result to the HTML site map.
A couple of other questions, however, were trickier. Its common to use GET forms for AJAX interactions. Whenever possible I want my Ajax to degrade and remain functional for older browsers - which meant that in many cases Googlebot should actually be able to successfully get through one or more AJAX interactions.
I hit a few sites that friends wanted me to check. Looked under the hood. Uhg. I’ve been accused of being an HTML snob (actually I am an OO-PHP/CSS/JS Snob; the HTML stuff is just an affectation), but I can ignore a lot - almost as much as IE6 ignores. For most web sites, the nastiest, least-compliant, and sludgiest pages are the ones containing forms. Bizarre, double-borrowed old javascript form validation; ancient on-click image form submits. Shopping carts tying JS functions to <FONT> tags!! Its a cruel world out there. Most of the forms were POSTs - even some of the site search forms - and would be ignored. On a lot of the others Googlebot would simply gag.
Then I started chatting with some folks about how to harness this new Googlebot functionality. One professional contact asked me “how many ways can you think of using this?” Thats when it hit me. This just wasn’t Google trying to dig deeper for web links.
Google’s Economic Stimulus Package For SEO Professionals
I have spent serious zen time converting BOLD-LARGE-SPAN tags to proper H1, H2, and H3 tags during site cleanups and redesigns. I have seen the ugliness that lies behind many a web site. I still run into plenty of web sites with no meta tags, the same title tag for every page, and improper basic HTML markup.
SEO folks have battled for years trying to convince site owners that just because a web page LOOKS good to you doesn’t mean it makes semantic sense to a search engine. More than once I’ve sat down with Lynx or an SEO browser showing customers what programming and code changes I would need to make to accomplish their SEO directions.
Now, we have to do it all over again for forms. And now, we not only have keywords and title tags and alt text. Now we have weighted responses, default responses, and targeted results to build into forms submitted by indexing spiders. And this requires more than just fixing tags. Responding to this opportunity requires real programming. It will be a new boom in SEO - probably even its own speciality!
Search Engine Formization? Search Spider Reponse Analytics? Conversational Search Submission? Where are my buzzword bingo cards when I need them????
Think of the the dozens of new strategies that can be developed!
- Default search results based on URL and keywords.
- Modified default results when AJAX-targeted forms are followed by non-js enabled bots.
- Secret hidden form tags for triggering targeted search results.
- White Papers. New Code. Unproven Techniques. Really Good Guesses on the Customer’s Dime.
- Google Form Spidering Is Kewl
Uncork the pixie dusk we’ve got marketing letters to write!!
In All Seriousness
This is another warning for all those site owners trying to stretch out one more quarter with their current web site. Spiders, search engines, semantic web tools, and web services are all rapidly progressing. For those sites stuck in HTML 4 land, with nested table structures and creaky unaccessible form structures, the Internet is becoming a harsher place.
Before throwing up a CAPTCHA or a new robots file, perhaps you should take a look at your entire infrastructure and digital landscape. Block Google form spidering with robots.txt if you must, but if you haven’t taken a look at the big picture, now would make an excellent time.
About this entry
You’re currently reading “172 and one-half ways Google Form Spidering Is Kewl,” an entry on PhilSpace
- Published:
- 4.13.08 / 10am
- Category:
- box-stomping
2 Comments
Jump to comment form | comments rss [?] | trackback uri [?]