Skrentablog blekko is hiring blekko is building a new search engine from scratch and I'm looking to hire a few more coders.Search is an absolutely fascinating problem to work on for a bunch of reasons. For one thing you have to scale the thing before getting the first user. You can't just start with a server or two and add more when the users come. Step 1 is to copy the internet onto your cluster. Step 2 is to analyze it..The componentry is remarkably deep.Search is like 7 hard problems wrapped into a stack. Distributed systems, html analytics, text analytics/semantics, anti-spam, AI/ML, frontend/UI. And scale... Apart from the sexy high end algos there are also the boring 10-year old system libraries and off-the-shelf tools that crack under stress and sometimes need a look. You open the hood and wonder how the thing ever worked in the first place...Plus there is always something fresh and new every day mining through the vast sordidness of the many billions of pages on the web. You expect to be amazed at the endless varieties of crazy porn domains and new approaches to webspam. But there are equal horrors in the small, finding pathological charset issues, previously-undiscovered abominable server implementations, psychopathic website owners. The web is a reactive fuzz test.I know there are some great coders out there reading this blog who would have blast working on some of the pieces here that need to get built. This is a great opportunity to join an experienced team early building a big system from the ground up. If you think you might be interested, send me an email and we can chat.fyi our interviews always have coding tests. Primarily we are looking for folks who love to write code and are good at it. :) Posted on May 1, 2008 12:11 PM | Permalink | Comments (7) | TrackBacks (0) How Fake Luxury Conquered the World The legend says that once upon a time there was a General Motors. This General Motors, GM for short, had a car and a brand for every need, along the plan developed by the great Alfred Sloan prior to the Second World War. There were Chevrolets for regular folk, Pontiacs for the cautious old people (and, thanks to John Z. Delorean's development of the 1964 GTO, for angry young people as well), Buicks and Oldsmobiles for doctors and successful businessmen, and Cadillacs at the very top, for the most successful men in the land. ... It would have stayed that way forever, but one day a mysterious yet important man at GM had a mysterious yet important idea: Executives should drive cars from their own division!Which leads to every division of GM building their own version of the Cadillac.Read more: How Fake Luxury Conquered The World(thanks Bryn for the tip) Posted on May 1, 2008 11:19 AM | Permalink | Comments (0) | TrackBacks (0) Microsoft bias in MSN search results, surprise I was looking to see what search sites might have a particular bug that I (ahem) came across and was trying the search for the number 0 in variousplaces. There is a pretty good Wikipediapage about zero. Zero has a rich and interestinghistory and there are many other potentiallyreasonable results.But I was surprised to see MSN search had demoted their good results belowsome crappy ones from MSDN: Lame! Falling into an inferior lex position and a lower overall relevance page to boost their own networkresults...give em credit for being old school. :)...I found my bug on Yahoo Search. I had tried a lot of smallerengines first because I didn't think a major would have this bug. You can't search for 0 on Yahoo. Youcan search for all the other numbers, but not 0 ...Why?.. Because 0 is false. It suggests Yahoo is using a scripting language to fronttheir search form, and a programmer did something like if ( $query ) rather than if ( $query ne '' ). Posted on April 24, 2008 7:45 AM | Permalink | Comments (14) | TrackBacks (0) Hypertable architecture talk Wednesday in Palo Alto Doug Judd will be discussing the internals and architecture of Hypertable tomorrow in Palo Alto at 6:30pm. Hypertable is an open source, high performance, distributed database modeled after Google's Bigtable. It differs from traditional relational database technology in that the emphasis is on scalability as opposed to transaction support and table joining. Tables in Hypertable are sorted by a single primary key. However, tables can smoothly and cost-effectively scale to petabytes in size by leveraging a large cluster of commodity hardware. Hypertable is designed to run on top of an existing distributed file system such as the Hadoop DFS, GLusterFS, or the Kosmos File System (KFS). One of the top design objectives for this project has been optimum performance. To that end, the system is written almost entirely in C++, which differentiates it from other Bigtable-like efforts, such as HBase. We expect Hypertable to replace MySQL for much of Web 2.0 backend technology. In this presentation, Doug will give an architectural overview of Hypertable. He will describe some of the key design decisions and will highlight some of the places where Hypertable diverges from the system described in the Bigtable paper.More details. Posted on April 22, 2008 12:51 PM | Permalink | Comments (2) | TrackBacks (0) Starbucks "re" branding It will be interesting to see how the return of the original starbucks founder Howard Schultz and the return to their orig plan and ideas turns out. He's had a successful stunt with the system closing for 3 hours to retrain workers in how to make coffee, which generated a lot of PR. Now the introduction of the new house blend, named after the original starbucks store. But also, surprise! - the original logo is back. Usually logos and identities get vaguer, cleaner and more abstract as a the MBAs wash/rinse/repeat. Starbucks is going back to the gritty and vaguely obsene logo they launched with.   Deadprogrammer famously detailed the history of the Starbucks logo going back to a 15th century woodcut. The original logo was slightly sanitized, but each corporate revision made it more and more abstract and less recognizable as to what it actually was. My wife said "I had no idea there was even anything inside that circle, I had never looked until you pointed it out to me."Face logos are great brands but they always seem to get watered down and more cartoony over time. This is the case with a lot of the face logos on food at the grocery store, the original versions were closer to actual faces rather than abstract logos (think chef boy r dee here.) This happened to KFC with the colonel...he started out as realistic line drawing of Colonel Sanders with the company name - "Kentucky Fried Chicken." After the waves of rebranding stylists were done with him he was an abstract cartoon. They couldn't stop there and abbreviated the company name. You're wouldn't want to realize you're eating FRIED CHICKEN when you're at KFC after all. You probably want to be eating a healthy salad with dressing on the side. That's why you went in there, right??I bet Dunkins Donuts wishes they could rename themselves "DD". Hmmm, maybe "empty vessel" names aren't so bad after all... :)Interesting to think about brand identities that get going because they're a little gritty and different and personal, they don't start out whitewashed / washed out, but after getting successful they put on the bland suit. What would the AOL redesigners do to Drudge's site if they bought it? Posted on April 22, 2008 8:58 AM | Permalink | Comments (2) | TrackBacks (0) Microsoft "hits back" at Google with re-launch of 4-year old Newsbot The memecrowd sure has a short memory... maybe I'm just showing my age here, but still.CNET: Microsoft hits back at Google with Live Search NewsSearch Engine Land: Microsoft Launches Live Search NewsSearch Engine Watch: Windows Live Search Offers Google News AlternativeMSN Newsbot? Anyone? From 2004:CNET: Google News faces Microsoft rival (Jul 27, 2004)Wash Post: Microsoft Deploys Newsbot To Track Down Headlines (Aug 1, 2004)Geeking with Greg: MSN Newsbot review (Jul 27, 2004) Posted on April 16, 2008 10:33 AM | Permalink | Comments (5) | TrackBacks (0) Web robot names considered, and rejected Google's is "Googlebot"Yahoo's is "Slurp"Cuill's is "Twiceler"It makes sense have a friendly robot user agent, so nervous webmasters won't ban it. You don't want to call your crawler 'sitejacker' or something.. Unfortunately my favorite candidates were: Crawlhammer Webraker Lurchy Client9hmmm. :-|"Oh no! It's CrawlHammer!!"If even in your heart you hide the urls ... there it shall rake for them......Does anyone know what the purpose of a '+' in front of an url in the robotsuser-agent is? Some sites put in the '+', others don't...Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)Mozilla/5.0 (compatible; Ask Jeeves/Teoma; +http://about.ask.com/en/docs/about/webmasters.shtml)Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)Mozilla/5.0 (Twiceler-0.9 http://www.cuill.com/twiceler/robot.html)Gigabot/3.0 (http://www.gigablast.com/spider.html) Posted on April 16, 2008 9:29 AM | Permalink | Comments (7) | TrackBacks (0) Cluster map propagation in Amazon Dynamo Dynamo is Amazon's scalablekey/value storage service. The paperis a good read, but I found the way the cluster node listinformation was propagated in dynamo to be a little odd.The algorithm is that every 60 seconds a node will talkto another node in the cluster, chosen at random, andexchange update information. I wondered how fast a changewould propagate through the cluster, so I simulated the propagation.For a 5,000 node cluster it takes about 9 update cyclesfor a change to reach every other node. Since each updateis on a 60 second timer, that's 9 minutes for a change topush out.I didn't do a very sophtisticated time model..plus thereis random start and all that. So maybe in practice it'sa little different. But 9 minutes seems like a long timeto propagate a host change out to the rest of the cluster.Maybe I mis-interpreted what they're doing?I recall some confusion about whether Dynamo was actuallyproviding SimpleDB, or if they were two separate softwaresystems. Does anyone know if this was resolved? Posted on April 14, 2008 10:34 AM | Permalink | Comments (2) | TrackBacks (0) AppEngine - Web Hypercard, finally Google's AppEngine is being compared to Amazon's EC2/S3. ButGoogle deserves credit here for coming up with a prettydifferently-positioned product. There may be overlap formany users of course, but it's really operating at a wholedifferent level of the stack.Folks that want/need more control over the environment,ability to manually manage their own machine instances,run code other than python, etc. will stay with EC2.EC2 is a step above RackSpace.But rather than thinking of AppEngine as a step above EC2, instead I think of it somewhere around Myspace. Or "Ning 1.0", as Zoho points out.In the beginning was GeoCities... No, even further back, in the beginning was Hypercard. Hypercard was a pre-web application for Macs that let you design a "stack" of pages - a website on a floppy, really. Popular stacks got traded far and wide. Hypercard stacks existed for every imaginable purpose - "Time Table of History", games, crossword puzzles, the Bible, etc.The thing about Hypercard was that it wasn't just static text and images like base html. It had a scripting language, a database, and the Apple UI built-in, so you could create mini applications.It feels like the web has been trying to claw its way back to the simple utility of Hypercard ever since Mosaic. GeoCities was the first massive-uptake anyone-can-build-here website haven. But it was all static html.Sure, you can paste javascript widgets onto your page, and have content driven by external sites. But to make the website a first-class object - on functional partity with a "real" website - it needs to be backed by a database and programmability. But setting up mysql, renting machine space, configuring linux, programming all the boilerplate, not to mention the scalability issues if your site gets popular -- this is all a big hurdle.So to hide all those details behind a platform that's easy to get started with, and lower the bar to entry to writing public application websites... Well that's a big deal. Hat's off to Google for bringing this to market.I'm not alone...somewhat similar thoughts from Nate Westheimer... Posted on April 9, 2008 12:10 PM | Permalink | Comments (8) | TrackBacks (0) Cuill is banned on 10,000 sites Be careful while you debug your crawler...Webmasters these days get very touchy about lettingnew spiders walk all over their sites. There are somany scraper bots, email harvesters, exploit probers,students running Nutch on gigabit university pipes, andother ill-behaved new search bots that some site owners nervouslyhuddle in forumbunkers anxiouslyscanning their logs for suspect new vistors, so theycan quickly issue bot and ip bans.Cuill, thesearch startup from ex-googlers anticipated tolaunch soon seems to have run a rather high ratecrawl when they were getting started that generateda large number of robots.txt bans. Here is a list of sites which have banned Cuill's user-agent "Twiceler".A well-behaved crawler needs to follow a set ofloosely-defined behaviors to be 'polite' - don't crawla site too fast, don't crawl any single IP address toofast, don't pull too much bandwidth from small sites bye.g. downloading tons of full res media that will neverbe indexed, meticulously obey robots.txt, identify itselfwith user-agent string that points to a detailed web pageexplaining the purpose of the bot, etc.Apart from the widely-recongnized challenges to building anew search engine, sites like del.icio.us and compete.comthat ban all new robots aside from the big 4 (Google,Yahoo, MSN and Ask) make it that much harder for a newentrant to gain a footing. However the web is so bloodyvast that even tens of thousands of site bans are unlikelyto make a significant impact in the aggregate perceivedquality of a major new engine.My initial take was that this had to be annoying forCuill. As a crawler author, I can attest that gettingeach new site rejection personally hurts. :) But now I'm not so sure. Lookingover the list, aside from a few major sites like Yelp,you could argue that getting all the forum seo'sto robots exclude your new engine might actually helpimprove your index quality. Perhaps a Cuill robots banis a quality signal? :) Posted on April 8, 2008 8:28 AM | Permalink | Comments (10) | TrackBacks (0) Did Powerset outsource their crawl? I've been seeing Zermelo, Powerset's crawler hitting my pages. Sort-of: ec2-67-202-8-249.compute-1.amazonaws.com - - [28/Mar/2008:23:31:06 -0700] "GET /2006/12/scale_limits_design.html HTTP/1.0" 200 11526 "http://www.skrenta.com/2006/12/i_took_a_ukulele_lesson_once.html" "zermelo Mozilla/5.0 compatible; heritrix/1.12.1 (+http://www.powerset.com) [email:crawl@powerset.com,email:paul@page-store.com]"They're using the open-source Heritrix crawler, running out of Amazon Web Services. But who is page-store.com? From their site:Vertical search sites are relatively costly to operate. A single vertical search engine may need to sweep all or a large part of the web selecting the pages pertinent to a small set of topics. Startup and operating costs are proportional to the input page set size, but revenue may be only proportional to the size of the selected subset.Page-store positions itself as a web wholesaler, supplying page and link information to vertical search engine companies on a per-use basis. The effect is to level the playing field between vertical search and general horizontal internet search.Page-store can provide selected page feeds based on deep web crawls page metadata black-box filters anchor text results link informationDid Powerset outsource their crawl? Posted on April 7, 2008 8:55 AM | Permalink | Comments (3) | TrackBacks (0) NFS server %s not responding still trying:) Posted on March 12, 2008 10:27 PM | Permalink | Comments (0) | TrackBacks (0) Who will stop Google from going to 90% market share? Jason predictsGoogle going to 90% market share.. He makes a solidargument and covers the bases. Referred traffic todaysuggests Google is at about 85%. Ask just quit the game,msn/yahoo put themselves into a tarpit. So the fieldis Google's...The only thing that can change this are new players.A string of uninteresting search attempts and lacklustercompetition have convinced people that it's impossible to stop Google's ascent.Google may have a network effect on ads, but the switchingcosts for the search app itself are small. Easier thanswitching free email providers. It's just another contentsite, and users are willing to try new search engines.There just haven't been any interesting new ones to try in a long time.I was hopeful that Wikia would launch something interestingand break the n-game losing streak of the upstarts, but sadlyit was another shallow effort.I'm rooting for Cuill next. They have a very credibleteam. Anna built the current version of Google, and nowshe's working on the next gen. If they launch somethinginteresting in any dimension, they'll show the market thatyou don't need a million servers and half of the phd's inthe field to build a search app. It takes 20 people and$5M of hardware...if you know what you're doing. Posted on March 6, 2008 9:53 AM | Permalink | Comments (6) | TrackBacks (0) The real reason Google's clicks are flat From SEO Black Hat:Google reduced the clickable area on Adsense text ads ... Before, a user could click anywhere on the ad and be brought to thedestination. After the changes, users have to click on something that looks like a hyperlink."The CTR on text ads declined about 60% in the last 2 months with Googles changes, Image ads on the other hand stayed the same."- January 4th, 2008 Marcus of Plentyoffish.com4 months later, that little back and forth in the Google Rec Room shaved about $85 Billion (with a B) in market capitalization.But it wasn't as stupid an idea as it might seem. You see, Adsense works in a Quasi-market place environment. The market will bid up the cost per click once the adjustment for accidental clicks is readjusted. Right now, marketers should be getting a better value per click as a higher percentage of the clicks are "real" or intentional. That will lead to higher bids per click and ultimately should be close to a break even for GOOGs bottom line.Is the Sky Really Falling?The problem is that in the interim, GOOG gives almost not Guidance to the stock market. Mutual Fund types are really too thick to grasp exactly what's going on, so they think that this "slowing" in the growth has to do with the potential recession effecting GOOG.Meanwhile, the real story is that Online Advertising Spending will continue to grow at about 30% per year for at least the next 3 years and GOOG is poised to take a disproportionate amount of that growth even if nothing else they do is even marginally successful. Posted on February 27, 2008 2:14 PM | Permalink | Comments (13) | TrackBacks (0) Lamport's Bakery Algorithm This paper describes the bakery algorithm for implementing mutual exclusion. I have invented many concurrent algorithms. I feel that I did not invent the bakery algorithm, I discovered it. Like all shared-memory synchronization algorithms, the bakery algorithm requires that one process be able to read a word of memory while another process is writing it. (Each memory location is written by only one process, so concurrent writing never occurs.) Unlike any previous algorithm, and almost all subsequent algorithms, the bakery algorithm works regardless of what value is obtained by a read that overlaps a write. If the write changes the value from 0 to 1, a concurrent read could obtain the value 7456 (assuming that 7456 is a value that could be in the memory location). The algorithm still works. I didn't try to devise an algorithm with this property. I discovered that the bakery algorithm had this property after writing a proof of its correctness and noticing that the proof did not depend on what value is returned by a read that overlaps a write. I don't know how many people realize how remarkable this algorithm is. Perhaps the person who realized it better than anyone is Anatol Holt, a former colleague at Massachusetts Computer Associates. When I showed him the algorithm and its proof and pointed out its amazing property, he was shocked. He refused to believe it could be true. He could find nothing wrong with my proof, but he was certain there must be a flaw. He left that night determined to find it. I don't know when he finally reconciled himself to the algorithm's correctness. ... What is significant about the bakery algorithm is that it implements mutual exclusion without relying on any lower-level mutual exclusion. Assuming that reads and writes of a memory location are atomic actions, as previous mutual exclusion algorithms had done, is tantamount to assuming mutually exclusive access to the location. So a mutual exclusion algorithm that assumes atomics reads and writes is assuming lower-level mutual exclusion. Such an algorithm cannot really be said to solve the mutual exclusion problem. Before the bakery algorithm, people believed that the mutual exclusion problem was unsolvable--that you could implement mutual exclusion only by using lower-level mutual exclusion. Brinch Hansen said exactly this in a 1972 paper. Many people apparently still believe it. ... For a couple of years after my discovery of the bakery algorithm, everything I learned about concurrency came from studying it. ... The bakery algorithm marked the beginning of my study of distributed algorithms. -- Leslie Lamport I find this story fascinating. Lamport has invented abunch of cool algorithms. But here he describes having"discovered" the |
|