gototopgototop

Weta Digital’s infrastructure for Avatar

Some interesting geeky info on the horsepower and tech used to help make a Blockbuster like Avatar possible-

Avatar smashed all box office records with a worldwide gross in excess of $2.7 billion and still climbing. Weta Digital, the visual effects company responsible for the film, had to break some records of its own to create Avatar’s stunning 3D visuals. Weta Digital was well versed in intensive graphics rendering from its work on The Lord of the Rings trilogy and other recent films, but creating Avatar was still a tremendous technical effort.

 

Weta Digital pushed the limits of its compute and storage infrastructure far beyond anything it had done before. When it began working on Avatar in 2006, Weta Digital was just finishing production on King Kong. At that time it had roughly 4,400 CPU cores in its “renderwall” and about 100TB of storage. By the end of producing Avatar, the company had roughly 35,000 CPU cores in its renderwall and 3000TB of storage. The capacity of the RAM in the renderwall alone now exceeds Weta Digital’s total disk storage capacity at the end of producing King Kong.

 

I started working at Weta Digital in 2003 as a system administrator as the last Lord of the Rings movie was being completed. Since then my primary role was the lead for Weta Digital’s infrastructure team. This team is responsible for all servers, networks, and storage. It was our job to create the infrastructure that made Avatar possible, and to solve any technical problems that came along.

 

Coping with Scale

Despite the tremendous growth that Weta Digital went through during the making of Avatar, managing the change in scale wasn’t as challenging as we’d feared. Much of this was due to having a well-seasoned team that knew how to work together. The team pulled together and, when something went wrong, we jumped in and fixed it. We worked hard and, for the most part, managed to be proactive rather than reactive.

 

We quickly realized that we were going to have to take two big steps to get to where we needed to be for Avatar.

 

  • Establish a new data center. Weta Digital had been using several small machine rooms scattered across several buildings. A new data center provided a central location to consolidate the new infrastructure that we needed to add during the course of the Avatar project. (See sidebar for data center details.)
  • Implement high-speed fiber. Weta Digital doesn’t have a localized campus. Instead, our campus is comprised of several independent buildings throughout suburban Wellington. We implemented a high-speed fiber ring that connected these buildings with the new data center. Each building was connected with a minimum of redundant 10Gbps connections with 40Gbps EtherChannel trunks in any situation in which storage and the renderwall needed to talk to each other.

 

These two elements gave us the physical capacity to scale our infrastructure as it grew and the bandwidth to move data freely between locations. New server infrastructure for the updated renderwall was created using HP Blade servers. With 8 cores and 24GB of RAM per blade, we were able to provision 1,024 cores and 3TB of RAM per rack. The new data center was organized in rows of 10 racks, so we built out our servers in units of 10 racks or 10,240 cores. We put in the first 10,000, waited a while, added another 10,000, waited a while more to add another 10,000, and finally put in the last 5,000 cores for the final push to completion.

 *To give an idea of what goes in a renderwall, below a server rack populated with machines.  A renderwall would consist of a "wall" with many racks similar to this.

 

PE1950-Rack-tn.jpg
42U Rack
 

 

Bladeservers enclosures that go in a rack look like this.  A rack would fit 4 0f these @ $32,000+ each and that's for used units!

PowerEdge M600 Blades  
Enclosure: 10U Rack-Mount, 16 Blade Capacity
Blade CPU: (2x) Dual-Core or Quad-Core Xeon 5000 Sequence, 1066MHz/1333MHz Front Side Bus
Blade RAM: (8x) 667MHz DIMMs - Maximum 64GB
Blade HDD: (2x) 2.5" SAS or SATA Hot-Plug HDD Bays
     

 

 

Accelerating Texture File Access Times with Adaptive Caching

In the visual effects industry a texture is an image that gets applied to a 3D model to make it look real. Textures are wrapped around the model to give it detail, color, and shading so it looks like more than just a smooth gray model. A “texture set” is all the different pictures that must be applied to a particular model to make it look like a tree, person, or creature. Most renders that include an object also apply textures to the object; thus, textures are in high demand from the renderwall and they get used over and over again.

 

A given group of texture sets could be in demand by several thousand cores at any one time. An overlapping group could be in demand by another thousand cores and so on. Anything we can do to improve the speed with which textures are served has a dramatic impact on the performance of the renderwall as a whole.

 

No single file server could deliver the bandwidth necessary to serve our texture sets, so we developed a publishing process that was designed to create replicas of each new texture set after it was created. This is illustrated in Figure 1.

 

old_method.jpg

Figure 1) Old method of increasing bandwidth for texture sets.

 

When a job running on the renderwall needed to access a texture set, it chose a random file server and accessed the textures from that replica. By allowing us to spread the texture load across multiple file servers, this process improved performance significantly. While it was a better solution than relying on a single file server, the publishing and replication processes were complex and required time-consuming consistency checks to make sure that replicas stayed identical.

 

We started looking at NetApp FlexCache® and the SA600 storage accelerator as a simpler way of solving the performance problems created by texture sets. FlexCache software creates a caching layer in your storage infrastructure that automatically adapts to changing usage patterns, eliminating performance bottlenecks. It automatically replicates and serves hot data sets anywhere in your infrastructure using local caching volumes.

 

Instead of manually copying our texture data to multiple file servers, FlexCache would allow us to dynamically cache the currently popular textures and serve them to the renderwall from the SA600s. We tested the solution and saw that it worked extremely well in our environment, so eight months before Avatar was due to be completed we took a gamble and installed four SA600 systems, each with two 16GB performance accelerator modules (PAMs) installed. (PAM serves as a memory cache to further reduce latency.)

 

improved_method.jpg

Figure 2) Improved method of increasing bandwidth for texture sets using
NetApp FlexCache, SA600, and PAM.                 

 

The total texture set was about 5TB, but once FlexCache was in place we discovered that only about 500GB of that was hot at any given time. Each SA600 had enough local disk to accommodate the hot data set and, as the hot data set changed, the caches adapted without us having to do anything. Aggregate throughput was in excess of 4GB/sec, far more than we’d ever achieved before.

 

Caching textures with FlexCache was a superb solution. It made things run faster, and simplified the job of managing texture sets. We were in the final year of a four-year movie project. If we had put the SA600s in and had problems we couldn’t resolve quickly, we probably would have had to rip them back out. But after a week had passed we pretty much forgot about them until the end of the movie. That’s about as happy as you can make an IT guy.

 

Storage performance has a big impact on the speed at which renders happen. Storage bottlenecks can choke the throughput of your render farm. In the final years of Avatar we started digging into what that meant and added lots of monitoring capabilities and statistics on every job.

 

There was a constant backlog of jobs waiting to run; each day there would be many more jobs waiting for render time than the wall could actually complete. Weta Digital’s team of “wranglers” monitors the jobs to make sure everything happens as it should. The morning after we put FlexCache in, the lead wrangler came into my office to report that everything had finished. It had run so fast he figured we’d broken something.

 By By Adam Shand, former lead of Weta Digital’s infrastructure team

 

Slick External Hard Drive From Samsung

It's kind of odd for us to be choosing external boxes since it's normal to do internals in a tower, but when storage cam up yesterday, these were just too slick to resist.
The prices was less than an internal OEM, so heck, when the PSU dies...which it always eventually does on these things...just strip the drive out and pop it in a dock or tower.


Another option (would void the warranty of course) is to strip the drives right away and swap them with the smaller drives that we now have sitting in a server tower.

Meh! fine as is for now methinks. :)

 

Hats off to Newegg! We ordered them yesterday aft. and they arrived this morning. MMV but fastest turn around I've EVER experienced!

 

Very slick design overall...The power dial (reminds me of the knobs on audio gear..sweet) has an on/off click and then dials up the brightness of a white LED that throws off a cool glow  from underneath the case body.

 

 

 


5 Cool things about personal cloud fileservers

We've been using dropbox  for a while now and it's great for sharing files with team members on a collab project. Sycing files you want accessible among various boxes/locations. It even has an Iphone app for mobile access.

For example, the other day I was sending some render output frames to the dropbox folder, and a pal who requested the frames was able to grab them, "live", as they were syncing in his computer's dropbox folder on the other side of the globe.

Only problem was I have the free account with a 2.5 gig limit, and I was uploading 3 gigs of data. Dropbox gives you a nice little notification that you are near your limit and a link to upgrade- $99 a year fee for 50gigs. I almost took the dive, but decided to do some file juggling to make it work with 2.5gigs and take a look around at some file sharing appliances I had investigated a while back.

There are a couple devices out in the wild for a while now. Both are basically small computers that run linux and allow you to plug in external hardrives via USB. pogoplug  and tonidoplug . Both are under $100 and allow you to share files from your own location with virtually limitless storage..take that dropbox upgrade! :P


Below are some of the things I found interesting about these units>

  • Power efficient green computing device that uses typically 5W - 13W.
  • Run personal file and app server. This means team members outside your local network, can mount shared folders as local drives (using WebDAV) for drag-and-drop download and upload support.
  • Automatic or Manual Backup of all your computers. with watch/scan and sync option!
  • Access browser based interface applications using an unique URL from anywhere.
  • Both have an Iphone app.
  • Open API for adding functionality. None have it yet, but I'd like to see a widget I can place in our website sidebar, like an RSS feed. Then tie that to a folder on one of our drives. The widget would RSS updated files for people to grab from the sidebar.


We'll be testing these and update on usability in a team environment. Next week stay tuned for Pogoplug updates. The second unit is out of stock at time of writing this post.

Oh, I almost forgot to mention, the pogoplug comes in pink, which a friend pointed out would "go well with my Barbie collection"...funny, funny stuff!

pogoplug personal fileserver