Inside Amazon’s new ‘Just Walk Out’: AI transformers meets edge computing
James Thomason@jathomason
July 31, 2024 6:07 PM
Amazon Just Walk Out
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage.
Learn More
On the first floor of an industrial modern office building, we are among a select group of journalists invited into a secretive lab at Amazon to see the latest Just Walk Out (JWO) technology.
Now used in more than 170 retail locations worldwide, JWO lets customers enter a store, select items, and leave without stopping to pay at a cashier, streamlining the shopping experience.
Harnessing the Power of Generative AI: How AI Is Changing Work and Beyond
We’re about to see the new AI-based system Amazon has developed, which uses multi-modal foundation models and transformer-based machine learning to simultaneously analyze data from various sensors in stores. Yes, this is the same fundamental technique used in large language models like GPT, only instead of generating text, these models generate receipts. This upgrade improves accuracy in complex shopping scenarios and makes the technology easier to deploy for retailers.
Our host is Jon Jenkins (JJ), Vice President of JWO at Amazon, who leads us past the small groups of Amazon employees sipping coffee in the lobby, through the glass security gates, and down a short dark hallway to a nondescript door. Inside we find ourselves standing in a full replica of your local bodega, complete with shelves of chips and candy, refrigerators of Coca Cola, Vitamin Water, Orbit Gum, and various odds and ends.
Aside from the electronic gates, and a latticework of Amazon’s specialized 4-in-1 camera devices above us, the lab store otherwise appears to be a perfectly ordinary retail shopping experience – minus the cashier.
Photo: We couldn’t take photos in the lab, but here’s the real deal JWO store across the square
How JWO works
JWO (they say “jay-woh” at Amazon) uses a combination of computer vision, sensor fusion, and machine learning to track what shoppers take from or return to shelves in a store. The process of building a store begins by creating a 3D map of the physical space using an ordinary iPhone or iPad.
The store is divided into product areas, which are discrete spaces that correlate with the inventory of products. Then, RGB cameras are installed on a rail system hanging from the ceiling, and weight sensors are installed at the front and back of each shelf.
Photo: In the real JWO store cameras and sensors are suspended above the shopping area
JWO tracks the orientation of the head, left hand, and right hand to detect when a user interacts with a polygon. By fusing the inputs of multiple cameras and weight sensors, together with object recognition, the models predict with great accuracy whether a specific item was retained by the shopper.
JJ explains the system previously used multiple models in a chain to process different aspects of a shopping trip. “We used to run these models in a chain. Did he interact with a product space? Yes. Does the item match what we thought he did? Yes. Did he take one or did he take two? Did he end up putting that thing back or not? Doing that in a chain was slower, less accurate, and more costly.”
Now, all of this information is now processed by a single transformer model. “Our model generates a receipt instead of text, and it does it by taking all of these inputs and acting on them simultaneously, spitting out the receipt in one fell swoop. Just like GPT, where one model has language, it has images all in one model, we can do the same thing. Instead of generating text, we generate receipts.”
advertisement
Image: JWO Architecture courtesy Amazon
The improved AI model can now handle complex scenarios, such as multiple shoppers interacting with products simultaneously or obstructed camera views, by processing data from various sources including weight sensors. This enhancement minimizes receipt delays and simplifies deployment for retailers.
The system’s self-learning capabilities reduce the need for manual retraining in unfamiliar situations. Trained on 3D store maps and product catalogs, the AI can adapt to store layout changes and accurately identify items even when misplaced. This advancement marks a significant step forward in making frictionless shopping experiences more reliable and widely accessible.
JWO is powered by edge computing
One of the interesting things we saw was Amazon’s productization of edge computing. Amazon confirmed that all model inference is performed on computing hardware installed on-premise. Like all AWS services, this hardware is fully managed by Amazon and priced into the total cost of the solution. In this respect, to the customer the service is still fully cloud-like.
advertisement
“We built our own edge computing devices that we deploy to these stores to do the vast majority of the reasoning on site. The reason for that is, first of all, it’s just faster if you can do it on site. It also means you need less bandwidth in and out of the store,” said JJ.
VentureBeat got a close up look at the new edge computing hardware. Each edge node is an approximately 8x5x3 rail-mounted enclosure featuring a conspicuously large air intake, which is itself installed inside a wall-mounted enclosure with networking and other gear.
Of course, Amazon would not comment on what exactly was inside these edge computing nodes just yet. However, since these are used for AI inference, we speculate they may include Amazon GPUs such as Trainium and Inferentia2, which AWS has positioned as a more affordable and accessible alternative to Nvidia’s GPUs.
JWO’s requirement to process and fuse information from multiple sensors in real-time shows why edge computing is emerging as a critical layer for real world AI inference use cases. The data is simply too large to stream back to inference models hosted in the cloud.
Scaling up with RFID
Our next stop, down another long dark corridor, and behind another nondescript door, we found ourselves in another mock retail lab. This time we are inside something more like a retail clothier. Long racks with sweatshirts, hoodies, and sports apparel line the walls — each item with its own unique RFID tag.
In this lab, Amazon is rapidly integrating RFID technology into JWO. The AI architecture is still the same, featuring a multi-modal transformer fusing sensor inputs, but without the complexity of multiple cameras and weight sensors. All that is required for a retailer to implement this flavor of JWO is the RFID gate and RFID tags on the merchandise. Many retail clothing items already come with RFID tags from the manufacturer, making it all the easier to get up and running quickly.
The minimal infrastructure requirements here are a key advantage both in terms of cost and complexity. This flavor of JWO could also potentially be used for temporary retail inside of fairgrounds, festivals, and similar locations.
What it took Amazon to build JWO
The JWO project was announced publicly in 2018, but the project R&D likely goes back a few years earlier. JJ politely declined to comment on exactly how large the JWO product team is or its total investment in the technology, though it did say over 90% of the JWO team is scientists, software engineers, and other technical staff.
However, a quick check of LinkedIn suggests the JWO team is at least 250 full time employees and could even be as high as 1000. According to job transparency site Comparably, the median compensation at Amazon is $180k per year.
Speculatively, then, assuming the cost breakdown of JWO development resembles other software and hardware companies, and further assuming Amazon started with its famous “two pizza team” of 10 full time staff back around 2015, that would put the cumulative R&D between $250M-$800M. (What’s a few hundred million between friends?)
The point is not to get a precise figure, but rather to put a ballpark on the cost of R&D for any enterprise thinking about building their JWO-like system from scratch. Our takeaway is: come prepared to spend several years and tens of million dollars to get there using the latest techniques and hardware. But why build if you can have it now?
The build-vs-buy dilemma in AI
The estimated (speculative) cost of building a system like JWO illustrates the high-risk nature of R&D when it comes to enterprise AI, IoT, and complex technology integration. It also echoes what we heard from many enterprise decision makers a couple of weeks ago at
VB Transform in San Francisco: Large dollar hard-tech AI investments only make sense for companies like Amazon, which can leverage platform effects to create economies of scale. It’s just too risky to invest in the infrastructure and R&D at this stage and face rapid obsolescence.
This dynamic is part of why we see hyperscale cloud providers winning in the AI space over in-house development. The complexity and cost associated with AI development are substantial barriers for most retailers. These businesses are focused on increasing efficiency and ROI, making them more likely to opt for pre-integrated, immediately deployable systems like JWO, leaving the technological heavy lifting to Amazon.
When it comes to customization, if AWS history is indicative, we’ll likely see components of JWO increasingly showing up as standalone cloud services. In fact, JJ revealed this has already happened with AWS Kinesis Video Streams, which originated in the JWO project. When asked if JWO models would be made available on AWS Bedrock for enterprises to innovate on their own, JJ responded, “We’re actually not, but it’s an interesting question.”
Toward widespread adoption of AI
The advances in JWO AI models show the continuing impact of the transformer architecture across the AI landscape. This breakthrough in machine learning is not just revolutionizing natural language processing, but also complex, multi-modal tasks like those required in frictionless retail experiences. The ability of transformer models to efficiently process and fuse data from multiple sensors in real-time is pushing the boundaries of what’s possible in AI-driven retail (and other IoT solutions).
Strategically, Amazon is tapping into an immense new source of potential revenue growth: third-party retailers. This move plays to Amazon’s core strength of productizing its expertise and relentlessly pushing into adjacent markets. By offering JWO through Amazon Web Services (AWS) as a service, Amazon is not only solving a pain point for retailers but also expanding its dominance in the retail sector.
The integration of RFID technology into JWO, first announced back in the fall of 2023, remains an exciting development that could truly bring the system to the mass market. With millions of retail locations worldwide, it’s hard to overstate the size of the total addressable market – if the price is right. This RFID-based version of JWO, with its minimal infrastructure requirements and potential for use in temporary retail settings, could be a key to widespread adoption.
As AI and edge computing continue to evolve, Amazon’s JWO technology stands as a prime example of how hyperscalers are shaping the future of retail and beyond. By offering complex AI solutions as easily deployable services, the success of JWO’s and similar business models may well determine broader adoption of AI in everyday businesses.
We’re about to see the new AI-based system Amazon has developed, which uses multi-modal foundation models and transformer-based ML.
venturebeat.com