Sam Gustman – “Cloud Archives” at the University of Southern California

Sam Gustman – “Cloud Archives” at the University of Southern California

October 23, 2019 0 By Stanley Isaacs


>>[Background Sound] Good
afternoon and welcome. I’m Denise Anthony, Director of the Institute for Security Technology and
Society here at Dartmouth. I am so happy to welcome you
inside on this beautiful day, but believe me it’s really worth it. I — this talk is sponsored by ISTS, as well
as, Department of Film and Media Studies, and our great thanks to Mark Williams of
the Department of Film and Media Studies for really getting the ball rolling on
all of this and inviting Sam here today. And also the Jones Media Center and the
Dartmouth Library and Digital Archives in the Dartmouth Library are also a co-sponsor to bring this event today
and so we thank all of them. And as you know we are here to welcome and
hear from Sam Guzman of the Shoah Foundation and to welcome him back to Hanover. Sam, for some of you who don’t know,
is a Hanover son, welcome him back. We’re glad to have him here. Sam is the chief technology officer of the
Shoah Foundation where he has been since 1994 and was responsible in 2006 for
the movement of the foundation, the archives from Universal
Studios to the University of Southern California where they are today. Sam is also Associate Dean at the USC Libraries
and holds a faculty appointment there, as well. As you’re going to see, if
you don’t already know, the just incredible accomplishment
of the Shoah Foundation. So as Chief Technology Officer, Sam is
responsible for the operations, preservation, and cataloging of the institute’s eight petabyte
digital library, which is not surprisingly one of the largest public video
databases in the world. He also leads the USC Digital Repository,
which is responsible for providing services to the USC community and also to
organizations around the world to help them manage their archives. He has grants from the National Science
Foundation, from many other projects that have been responsible for the
incredible archiving and cataloging that is the Shoah Foundation
that Sam is going to talk to us about today and he’s going to explain this. But for anyone who doesn’t know what the
Shoah Foundation is, the USC Shoah Foundation, the Institute for Visual History and
Education is dedicated to overcoming prejudice, intolerance, and bigotry
through the educational use of the institute’s visual history testimonies. And the institute is the custodian
of the visual history archive, a collection of over 51,000
audiovisual testimonies from Holocaust survivors and other witnesses. So it is just a great honor and pleasure
for Dartmouth to welcome Sam Guzman. [ Applause ]>>Thank you, everyone, for
having me here at Dartmouth. It’s been a wonderful few days meeting
with all of the various groups. I would like to start just by playing
a video about the Shoah Foundation, it’s founding after Schindler’s List by
Stephen Spielberg just to give some context and set the tone for the actual talk.>>[Background music] Why did you decide
to tell us the story of your experiences?>>Because every survivor has a story to tell. Some people think, “We’ll
I’ve heard the story before,” but from every story you learn a little bit.>>I remember talking to a lot of
Holocaust survivors during the production of Schindler’s List.>>How did you survive?>>By miracle.>>It was one of the first moments that
made me realize that there were many, many stories that needed to be told. That’s how the Shoah Foundation began.>>It was a tremendous undertaking. There were thousands of people all
over the world who came together to collect the 52,000 survivor testimonies. [Inaudible Foreign Language]>>The location is London, England. [ Inaudible Foreign Language ]>>[Background noise and music]
The original collection was taken in 56 countries in 32 languages. The survivors gave us their testimonies so that we would tell the world
and educate future generations.>>[Background music] I’m of an age where
I don’t know how much longer I’ll be here. I felt it’s important that it’s not forgotten. [Inaudible Foreign Language]>>[Background music] The survivors who
gave us their stories trusted us to hold art in those stories and teach
with them far into the future. So it’s our responsibility to preserve them, to make sure that they’re here
now and 100 years from now.>>What better place to put these testimonies
than at a great research university so that scholars from all
disciplines could use these materials? Things that will die on the page
of a history book will come to life when you see them in the
context of this archive.>>This archive is based at USC, but we have
many university partners around the world. We work with faculty and
students in many disciplines. Students are watching testimonies in their own
languages, learning what it means to be human and to go through that experience and to make
it part of their story in the world today.>>We’re going to be listening to individual
stories because history happens to individuals.>>Two little boys came and they were
looking at my father, just staring at him. He said, the one said, “Well,
we were told when you look a Jew in the eye you can see the devil in them.” He says, “I just wanted to see if
that’s true,” a little 18-year-old kid.>>Genocides have not stopped
since the holocaust. It’s really important to expand
the archive with new content.>>We’re now collecting testimony
from a variety of different genocides that have taken place in recent history. We really can’t compare human suffering,
but the causes and consequences of violence in genocidal societies we need
to understand better and deeper.>>All these examples of intolerance and racial
hatred have to be taught so the young people of the next generation will never
allow this to ever happen again.>>Young people are forming who they want to be. They’re developing their identity. They’re figuring out what they think
about things, how they should be behaving.>>I always feel guilty like if something
happens to someone else and I just sit by and watch because I don’t want to get involved and I think it’s important
that you do get involved.>>There will be occasions when we have to make
the choice; are we bystanders or do we act?>>Our attitudes really do matter
to the way in which the world works. If we can’t make this work in our own
backyard, we’ll never make it work in the world.>>When you see an injustice, stand
up and speak up, and take a stand.>>Try to understand each other and if you cannot love each other,
at least respect each other.>>We have a message. We were there. We can talk about it.>>I’ve seen young people, young
students who have seen these testimonies. It does change them and that always
gives me hope that the more we get this out into the world, thousands
and someday millions of young people will be in
some way changed by this.>>These survivors who are now
educators, they can change the world. [ Music ]>>So again, I’m Sam Guzman,
I’m the Chief Technology Officer at the USC Shoah Foundation Institute. I’ve been working there since 1994. And in 1993 when “Schindler’s List” first came
out Stephen Spielberg started getting thousands of phone calls from survivors who
wanted to tell him their story. He decided to just try and go get as
many of those stories as he could. It became a $250,000,000 project set up in
56 countries around the world collecting, at its height around 12,000 interviews per year. We ended up collecting a large number of videos. We started — we ended up with 235,000 tapes
for the 52,000 testimonies from 57 countries in 34 — not 56, 57 countries, 34 languages. We had 3,000 employees; interviewers and
videographers working around the world. We ended up with a cataloging system where
we broke every video down minute by minute so each minute was searchable
like it was its own webpage. So we have 7.5 million clips of video that the
archive’s been indexed into, using 62,000 terms. When the video was actually taken, people
would talk about other people in their lives and we indexed all of them, as well. We have 1.2 million people that are brought
up or mentioned in the archives themselves, 46,000 locations that people talk about,
and a-half-a-million images that they show of their family and other areas
that were of importance to them. The interviews themselves are
broken down into Jewish survivors. We interviewed over 49,000 rescuer and aide
providers; we interviewed gypsy survivors, people in the military, political prisoners,
and a few rare interviews we were able to get with groups like homosexual
survivors to round out the archive. We’ve started on other genocides. So what happened was we did the
collection from 1994 through 2000 of the Holocaust survivor testimonies. These are some new ones that we started taking. We’re doing two things; we’re going back and
finding archives on the Rwandan genocide, Cambodian genocide, Armenian genocide
that existed and we’re bringing those in and starting to process those. But we’re also starting to gather new
testimony slowly from these groups and these were our first initial tests in
these areas to bring in those testimonies. So from ’94 to through 2000 we
did the holocaust testimonies. We indexed them from 1998
through 2005, and then in 2006, Stephen Spielberg took the Shoah Foundation
that was the non-profit that he started and he gave it to the University
of Southern California. And when it was given to the
University of Southern California, it was decided a few different things, but
one of them was that they were going to expand into doing these other histories. So we’ve actually started on the Rwandan
genocide, Cambodian genocide, Armenian. We’ve also started on the Sudan this year. We’re going to be starting a project on
collecting testimonies from Sudanese survivors and we’re starting to look
at Serbia-Croatia, as well. [ Pause ] When we collected a testimony
we had a paper slate up front which introduced the survivor,
they then introduced themselves. And they were also introduced
by the person interviewing them and we talked during the interview about their
life before the war, during the war, and after, and then we asked them to show
photos or artifacts of interest and talk about them on the video. We also then had them bring their family
out and we also did a number of interviews where they would go out to do site visits
to talk about what they experienced at the actual physical locations
that they were at. When the testimonies came back into the
foundation, we made a number of copies of it and then we starting cataloging
those interviews. By cataloging, what we did is at a top level
for each interview we put in basic information like people’s date of birth, their
experience, their religious affiliations. But then we also started to treat each
minute of video like it was its own web page. And on each minute of video we would put
geographic locations, latitude and longitudes, types of places, dates, historic events, any
people, organizations, all different kinds of elements and objects that are talked about in
those videos are added to each of those minutes so that you could search to them
and — through many different ways. And what we also then started to do after we
did the cataloging was start preserving them. So one of the other issues we started
running into is that everything rots. Film, conservatively you get 50 years, videotape
20 years, computer hard drives five years, LTO data tape three years, DVDs
two years before we see rot so basically the newer the technology
the faster we’re seeing rot in it. And that’s really difficult because we had
taken all these videotapes in 1994 and we knew by 2014 they’d start to see age-based damage. So we started to build mass robotic systems that
can process 80,000 tapes a year and take all of those videotapes, turn
them into computer files, and store them in very large computer systems. But the issue we had was as soon as we
started digitizing things and putting them onto computer systems we knew those were
rotting even faster than the original videotape so we had to build two systems;
one to turn everything into files, one to stop computer files from rotting. And so we do that through
what we call mass migration. Once all these files are in an
actual preservation system we’re able to do a fingerprint that’s called a
[inaudible] of each of the actual files and so what you do is you run a file through
this algorithm, it gives you a unique number. If you run that file through
that algorithm again and it gives you a different number you know one
of the ones and zeros in that file have changed and that the file is starting to see damage of
some kind, most likely due to age or some kind of — it could also be a computer error. So what we do is we constantly run
the files through this fingerprint, and as soon as we find a file on
a tape cartridge and we have — we have thousands of these that are storing
the digital data that has an error in it, we load its twin up because we
keep four copies of everything. We have two in these large tape robots and
then we have two in larger disk storage, cloud storage systems, which
I’ll talk about in a bit. But what happens is if we discover that there’s
a file or a piece of video that’s starting to rot, we’ll load up its twin, copy over onto a
brand new piece of media the exact same content and we’ll throw away the actual
data tape that’s beginning to rot. What we also do is we don’t trust
any media more than three years. As soon as a piece of media in our archive
hits three years old, we migrate off of it, check all the bits again, and we throw
away the three-year-old piece of media. So we’re constantly migrating to actually
preserve the files that the video are stored in and the media that they’re stored on. Once we’ve done the cataloging and
we’ve actually done the preservation of all the materials, we begin to provide
access to them to the general public, to universities, and to high schools. And we do that through various different sites. We have a generic website, which has some clips. We have a YouTube channel, which
goes out to the general public, but we have the visual history archive, which
Dartmouth is just about to subscribe to, which gives access to all 700,000 hours of
the video, 52,000 interviews instantly online. And so that will be here
hopefully within the month. And let me show you what it looks like by
showing you an actual demonstration version or a public version that has access to 1,100 of the 52,000 interviews, but
gives similar functionality. [ Pause ] So this is called a Visual History Archive and what we’re doing here is we’re
providing actually many different searches. A search is a strange thing,
the more structured a search is, the easier it is to narrow things down. But everyone loves Google which
is completely unstructured. So we offer the researchers the unstructured
search down to more structured searches. So I’ll start with the most unstructured search,
a quick search, where we could just type words. What I’m able to do is type in the word, “Food”
and what it’s doing on the left is it’s going through all 52,000 interviews in all
32 languages and it’s saying, “Okay.” If someone’s talking about
food in Russian, that’s a hit. If they’re talking about
it in Czech, that’s a hit. If they’re talking about it in
Hebrew, that’s a hit, or English. So there’s 29,000 testimonies regardless of
language where someone’s talking about food — and that’s what’s just happened here. I’m going to search on some other
— I’m going to put in, “Hiding.” So now it’s searching on food and hiding
and I’ll move it down to 17,000 — I’m going to type in Auschwitz
and try and narrow it a bit more. And then on the left, you’ll
see we’re going down to 4,700. I’m going to click on — so there’s 4,700 out of the 52,000 interviews we
have that talk about these things. But I’m just going to click on the ones that
we have available for this demonstration out of the 11 — out of the 1,100 we have
available, 208 of them meet this criteria. [ Background Noise ]>>I’ll just move this to full screen so it’s
a little easier to see and move around in. So what I just queried on was food, hiding and
Auschwitz, food which is in blue, hiding — which is in green on my screen and yellow
up here and Auschwitz which is in orange. And what I can do is look at some —
for instance, this first interview here and what it’s telling me is that
at minute 141, next to food — next to the concept of food provision
where we queried on food and then at 141, Renata Adler [phonetic] is going
to be talking about food there. She’ll talk about it at minute
40 on food rationing. She’ll talk about it in respect
to ghettos in minute 84. She’s going to talk about
hiding people at minute 18. She’ll talk about these other sub camps
of Auschwitz at minute 115, 114 and 134. So if I want to watch her
testimony from the beginning, I can click on her picture and her name. But if I’m interested, let’s say, in hiding
people, I can click here on this 18 right here. And we’re going to jump to minute 18
where she talks about hiding people.>>Did you ever celebrate any of
the holidays like Passover or…>>Not in the house we couldn’t.>>So sometime during that minute,
she’ll start talking about it but — and we’ll watch some more testimony in a
little but that just gives you the idea of how the search engine works
and how we’re able to jump around. Well, the other ways we’re able to move through
the archive is we can look at the key words in alphabetical order and we can start moving
through those sort of like the index at the back of a book but instead of page
number, we have minute numbers. So that if I’m interested in something
like forced-labor, I can click here, jump to minute 54 and listen to her
talk about forced-labor at minute 54. So we’re now treating each minute of the
video as its own entity with its own keywords that we’re able to move to
and start watching right away. And so we’re now able to jump
through video similar to the way that you jump through text on the web. I’m also able to look at it in
ways different than just key words. We indexed all the people. Again, we talk about 1.2 million people
that are talked about in the archive and out of the 50,000 interviews. These are all the people that this particular
survivor talks about during her interview and what minute she talks about them at. [Background speaking] If I click
here, she talks about her mother…>>This picture of my mother
Greta [inaudible]…>>And actually one of the things that we
— the survivors did was they showed us, if they wanted to, pictures or
artifacts that they were interested in and they talked about them in the video. So we actually went in as they were
talking about them in the video and took snapshots to make a slide show. So these are some snapshots from this particular
person’s testimony we just saw her mother there. And so what you can do again is if you find
a snapshot of interest, you click there and it jumps to that moment in video where that actual [background
speaking] snapshot’s being talked about. So we index in by key words, people’s
names and images into all the video at this particular point to be able
to move through all the content. [ Background Noise ] We have different kinds of searches
that we enable with various structures. That was the sort of the — loosest one but we
can start to query based on people for instance. So if I type my name in, Sam, we
index by what we call complex objects and I’ll show you what that means in a second. So there’s 8,500 people who said they had — who had the first name Sam in the
archive in 65 — 6,400 interviews here. What I’m able to do is I’m able to
highlight this first person here and this one’s interesting because
it’s hit on Caroline Richardson for her husband George Richardson and you don’t
see the name Sam anywhere until you scroll down. We didn’t just capture people’s
names at the time of the interview; we captured it throughout their lives. So if they were born with a
different name, we captured that. If it was a woman who got
married, we’d have her maiden name. In this particular case, George Richardson was
actually Samuel Ryke [phonetic] when he was born and when he came over during the war, he
changed his name to George Richardson. So the reason Sam was a hit on this particular
interview was because he had changed his name. [ Background Noise ] We also started to build queries around
the various experiences that we captured, so Jewish survivors, if you’re interested
just in that particular area or people in the military, you could just
query on that particular area. So what I can do here is I can go in and
start building a set of questions instead of having a whole form to fill out, we
let you pick what parts you want to do. I’ll do a very simple one here
for instance on City of Birth. I’ll type in Radom, Poland and I’m going
to pick it and it’s going to tell me that there’s 222 interviews where someone
says they were born in Radom, Poland. What I can do is start to narrow that down
if I want, let’s do length of interview. So out of those 222, if I’m
interested, 104 were less than two hours. 116 are two to six hours. Two of those 222 are six to 10 hours. We don’t have any of those 222
that are more than 10 hours. So as a researcher, I may be interested
in the longer ones, I can click there. And I can now go look at those two interviews. Before I jump in, if I want to double check,
I can look at the actual top level record for this interview and I see that he
was born in Radom, Poland and I can see that Saul Adler also — these are his
experiences I’m about to hear about, what ghetto he was in, what camps he was
in, whether he was in hiding et cetera. We then offer even more structure
searches, so search — we’ve standardized on something
called Z39.19, the basic idea of it is when you make your database,
any kind of database, you need what are called Whole Part Inheritance and Associative Relationships
to put sets of things together. So if you had the word “Car,” the
tires and engine are a part of a car. A luxury car is a kind of a car, the
driver’s associated with the car. We have 62,000 terms that we’ve tied together by their Whole Part Inheritance
and Associative Relationships. And by applying those terms to each minute
of the video, it allows us to bring back sets of video and do all kinds
of different intersections with them or presentations of them. So at a very basic level, this is exposing the
whole part relationship so I just opened health down to mental health down to bystander regrets
and I’m able to see that there’s 52 interviews where they talk about regrets as a
bystander and these are all the synonyms for those actual terms and at the bottom
is the definition of the use of it. I’m going to do a quick query
on forced, march, food… [ Background Noise ] …and I’m going to intersect that
with hunger so I get some hits. [ Background Noise ] And so you’ll see over here,
there’s 23 interviews and 43 one-minute clips where these came back. What I can do now is go take a look at these
testimonies and here you’ll see this gentleman, William Shapiro, talk about forced-march,
hunger, start talking about it at minute 68 and really get into forced-march,
food at minute 70. So I’m going to click and we’ll listen for two
or three minutes to this clip of testimony. [ Background Noise ] [ Recording ]>>…precise number which are many — many marches into a [inaudible]
restaurant because it’s a very large room. Very coarse and [inaudible], this room. [ Pause ] I was — I was in pain and I had my head
down and no thoughts or anything else. I was just in pain. And a poor woman had been
shot into the plain room. And then the next morning as we were being
marched out, I began to feel pangs of hunger and I was thirsty because I really hadn’t had
anything to eat for 24 hours and I was so — I began to feel pangs and those pangs
never left me for the next four months. And they began to march, we were marching
alongside the road and the German troops in their cars in the middle and we were
in the ruts on the side of the road. And people staggering and I
just had this terrible headache and we were marching and walked back. I wouldn’t say marching,
we were really walking back and then the hunger pangs were getting worse. No water — nothing’s available. And I think we must have walked a whole
day before we were allowed to sit down and I don’t think we got any
food until the day later. And what I got was a piece of brown bread. It was not the same bread I got when
I was in the concentration camp. It was hard, dark bread and there was no water. And I remember sharing that bread
as we were walking farther back. And eventually we got the
first taste of the green soup. I didn’t have anything to carry my soup in so
I had to use my helmet liner for this ladle of soup that they gave us and it
was ugly tasting but it was liquid and I remember using my helmet liner to get
my soup in and I still had some of the bread. That’s the only food I ever
had, that I ever had…>>So you get the idea by searching on specific
terms and putting them next to each other and intersecting them, you get clips of
video that are more pertinent to the area that you might want to be looking at. I’m going to show you a different
way of actually looking at testimony through an interface we have
that’s called Eyewitness. This is actually for high school
students and the whole idea behind this is that there’s a new area of digital literacy. So the concept is that there’s a lot of
structure in school for kids to learn to read from books and there’s a lot of structure
in school for kids to learn to write papers. They’re coming out with the standards for kids
to learn to read from the internet and for kids to learn to produce to the internet. And so we’ve been following these digital
illiteracy standards and intersecting it with the needs of teachers
for tolerance education, bully education to meet various
curriculum standards in their states providing user interface for
them to use there as well as folks interested in studying the Holocaust history. So what a teacher will do at a very simple
level is they may just go in and look at pre-defined clips about specific topics.>>Recording: Apparently knocked on the door and
they took us in and they were religious people. They were some sort of Evangelical
sect and they took us…>>I won’t play this whole clip but basically
this is where she’s talking about her experience as a child there and as a hidden child. So we what we have is also links at the
bottom to content in the U.S. Holocaust Museum and Encyclopedia that talks
about hidden children so that there’s more context
to what they’re looking at. Then we start to get a bit fancier. We give the teachers for their students
the ability to create their own web pages and activities on — for their
variance — their different curriculum. So here they’re creating — the teachers
created a set of pages on this particular poem and testimony around this poem so the
kids will start to answer questions. And then as they go through, they’ll get more
and more content that the teacher wants them to either watch or answer questions on. And when they’ve done — going
through all their content, they’ll start searching for
content on their own. We teach them how to use the
search engines you saw before and bring their own clips of information back. And when they’re finished doing that, we
ask them to go to a video editor and start to build their own projects together. So I’m launching the video editor here
that they get and basically what we do, once they’ve searched on certain videos
they’re able to take those videos, drag them down into a timeline and start
building either through videos, music, various special effects or adding their
own video themselves or their own audio. They’re able to build up their own project
and I’ll go play one that a student created in a little bit so you can
see what that looks like. And once they’ve done with that, they’ll go and they’ll have FaceBook-like
communication with the rest of their class. So let me go show you what a
15-year old student made… [ Background Noise ] …that I thought was pretty
good for a 15-year old. [ Background Noise ] And these connections pop up…>>[Recording] [Background music]
When leaving comments make sure that your words are respectful and helpful. Never use hurtful language and put-downs. Your actions and words in the online
world should represent the person you are in the real world. In this way, you’ll be practicing
good digital citizenship.>>Again, for high school students we preach
to them a little bit through these clips as they start to move through the content but
it becomes more and more important as we do it. So I’m — we’ll just watch a quick
minute of this [background music]. This is something a 15-year old put together. [ Music ]>>[Recording] [Background music] Resistance
is the attempt to withstand or oppose a force. Many people believe that
resistance is through violence. For Roman Kent, a Holocaust
survivor believes differently. [ Noises ]>>[Background music and speaking] You could
see where she’s struggling putting music over the top of the survivor et cetera, but
she’s doing her own video, uploading it, mixing it and handing it in as
her — as her actual project. [ Inaudible Recording and Background Music ]>>Education and…>>Again, that’s the aspects
of digital literacy. They’re learning how to mix together content,
edit it, put it out there in a way that — that — that pushes their message. And so there’s a few things that
we do to help them actually. We create what we call scaffolding or these
videos that teach students about what it is that they’re trying to put together. So I’ll play the first minute of a video
that we provide for the students called, “Ethical Editing” and it talks about
how to edit as a good digital citizen. [ Background Noise ] [ Music ]>>[Recording] [Background music] To make
the most of your experience with Eyewitness, it’s important that you have an understanding
of the basic concepts of video editing. It will also serve as a foundation
to translate that knowledge into an ethical use of those tools. In its most basic sense, editing is a process
of selection, recombination and juxtaposition. Those are the fundamental processes
by which editing transforms, you know, source video material into your own work. Selection is simply the idea that from
a larger piece of source material, you are making discrete choices about which
part of that material you want to work with. The next concept is that of recombination. So that’s the idea that once you’ve made
these selections, you might choose to put them in an order that’s different from the order
they occurred in the original source material. You might take those pieces and put
them in a different order than that in which they originally occurred. The next important concept
is that of juxtaposition. The way you understand one
clip is very much dependent on what you saw before it
and what you see after it. So juxtaposition is just a way of
describing the decisions you’re making about the relationship between different clips. The choices you make about how those different
pieces fit together is really what gives them meaning. You can really make anything mean
anything you want it to mean. So when you’re editing, it’s very important
to always be respectful and conscientious of the original material as an…>>And I’ll stop there but
you get the basic idea there. I’ll show you a little bit of, “What is Search?” We — again, these are aimed at high school
students but it will give you some idea of some of the other kinds of videos
[background music] we put together. [ Music ]>>[Recording] [Background music]
Google, [laughter] like it — it like…>>They [inaudible] — well first they
have like the stuff that [chuckle]…>>[Background music] I think it would go like
probably a satellite and then do something and use some weird Math equation or
waterfall algorithm and like get the answer.>>They come flying in on little
envelopes [laughter] all the way through it and end up on my screen in
the form of words [laughter].>>So what exactly happens when you
hit the search button in Google? It’s really quite complicated.>>He doesn’t know.>>That might be true. It’s a trade secret. There are a few things we do know, though. For example, did you know that when you click
that Search button, you’re actually searching through over 1 trillion web
pages on the internet?>>Wrong, it’s the indices that get searched.>>[Background music] Okay. So let’s explain indexing. Say you were looking for a
particular person in a book. Would you start flipping through all the
pages of the book to find this person? What you do is you go to the end
of the book and look in the index and try to find your person there.>>Don’t make it so complicated
— spiders, index, internet.>>Okay. So what are spiders? Spiders are software applications
that actually craw the internet; go onto different web pages
and create these indexes. The search key words you’re looking for might
exist on thousands, even millions of web pages. How does Google know what
results to return to you first?>>They’re smarter than you.>>The answer we’re looking for is page rank.>>Okay. But that’s just one of the ways.>>So what is page rank? Page rank is Google’s formula of ranking
web pages based on how many other web pages on the internet have linked to it. The higher the number of links,
the higher the page rank.>>Correct. [ Music ]>>So all these basic concepts, how to
search, how to edit, how to do it ethically, these are things that aren’t actually taught in
school right now and so we’re having to wrap all of these up into the actual applications that
we release for the tolerance education efforts that we’re doing at the high school level. But that’s all part of doing
the digital literacy. So again, what we’re doing here at Dartmouth
though is the visual history archive which is currently available at — at 41,000 institutions and it is a
larger version of what I just showed you. I showed you had only 1,100 testimonies worth
of video and this one will have 52,000 — all 52,000 and you’ll have researchers
who’ll want to come to Dartmouth who — to be able to get access to it, you’ll have
researches at Dartmouth who will be able to create projects of testimony and be
able to use them in their classrooms. And we find that there’s lots
of different uses of that. But basically, all of these — this
content is going to be accessed over internet too, to a local cache here. We actually have our own caching
environment that we built. By caching, it means there will
be a small percentage of the video that will be here locally so that it absolutely,
positively will play in the classrooms. And professors will be able to lock that
content there and guarantee that when they want to play those testimonies, it won’t be
interfered with by the internet or anything else because it will be local for — for those
professors to use here at Dartmouth. These are the places that are
currently have access to the archive, Dartmouth will be right there at the end
as soon as we get things up and going. Again, USC was actually the
first place and University of Michigan being the second
that started using the archive. What we see when folks at the various
universities start to use the archive and there’s 325 courses being
taught with it at the moment, is that researchers spend
about an average of 49 minutes. The top thousand users are spending an average
of 55 minutes each time they log into it to search through and watch content. We’ve got — had 240,000
searches to date and 35,000 of the 52,000 testimonies have been watched
by researchers doing various projects. So that’s about six or seven year’s worth
of material that’s been watched so far. Dissertations are growing, 35 so far and
articles and books published, we’re at 73. One of the struggles that we have is we’re
constantly trying to expose the content, get it out there, get people
to learn from it and use it. But there’s a duty of care here where the
privacy of the survivors, the potential misuse, intellectual property that
may exist in the content, those things pull against actual access. So we are always struggling against that,
doing everything we can to get the content out there, at the same time securing it. Obviously to secure the content, you never
show it and — but that’s not acceptable. But putting it out there exposes the
people who give their testimony to danger and we’ve had issues like that like, you
know, all of the genocides have their version of deniers whether it’s like with Armenia
where it’s the whole country of Turkey or it’s the Holocaust with
the Holocaust deniers. We get people who are rabid about it but it
actually if you ask the survivors about it, they’re like, “We survived the real thing. What are these guys going to do to us?” So it’s really just about the people managing it
and how comfortable we are moving their material out there and we’re trying to move as much as
we can in as responsible a way as possible. I’m going to change hats for a second from the
Shoah Foundation over to the Digital Repository. Basically one of the things USC decided was that
when we brought the technology in to be able to care for the Shoah Foundation Cloud Archive, we were going to also make it
available for other collections. So what we offer up as a service
to institutions around the world, universities’ collections are
digitization services, cataloguing services, digital preservation, digital library
access and file server services. So for instance, if you’ve got a
large collection of video tapes and you don’t have the money to buy the
systems to digitize them yourselves, but you want them to be academically
accessed, you can come to us and we’ll put them through our robotic systems
and digitize them for you. We will also go through and provide
cataloguing of materials for people. This is sort of the Cloud Archiving model where
we do this work for others and so we have armies of students that will go through and
catalogue material as we get new collections. We also will preserve the material
using the same preservation systems that are used for the Shoah Foundation. This becomes important for researchers
who have things like data management plans from the National Science Foundation, the
NIH, where they have to show what happens to their data after their
program is actually completed. So we offer 20-year models and other
storage models for preservation for them so they can actually store and say their
stuff’s preserved when they get their grants. We’ll also provide web sites. We’ll do the kind of web interfaces that
you saw but for other kinds of collections and we have everything from a group on Albanian
Human Rights to a studio like Warner Brothers to — we even have a group that has original
film from the ’50s of surfing around the world and we’ll make that available as well as
they want it to — for various researchers. That’s more research they think in California than in other places [laughter]
but I don’t know that for sure. We — we’re offering file
server services by bringing in cloud storage from various companies. What that means is that from Nirvanix and EMC
with Iselin, we’re able to offer up storage over the high-speed networks on disk
that’s attached to our super computer. We charge $70 per terabyte per
month which is what you’ll find if you’re actually accessing the material. It’s much cheaper than Amazon or any of those. And then for the folks that have the
grants that need to step away for — for 20 years from the — need to step
away from the content after a few years and show that it’s being cared for. For any researcher in the world,
we will give then a 20-year price so they can per terabyte give $1,000 as
part of their grant and then we’ll insure that that content’s there for 20 years
bid accurate through the USC Libraries. And then it will go into the — the USC
Library’s regular selection process after that. We offer many different technical
protocols for accessing that. We — we’re also able to tie into most
of the commercial products out there that do archiving such as Semantic or Commvault. So if — if researchers are storing or archiving
their things using commercial off the shelves products, that will just sync
right up to our cloud storage and we’re also offering Dropbox-like
functionality for researchers who want to be able to look at their content on
iPhones, iPads, MACs, PCs all at the same time. For data security, if you’re in the medical
area, you’ll recognize these things as some of the requirements for HIPPA compliance. We have HIPPA compliant storage
for storing that. The SSA 16 Type 2 Certified Compliance
is probably the biggest area there. I’d be happy to talk to anyone
who’s interested in that and how we maintain HIPPA
compliance as a cloud service. Our target customers again, for this, are researchers who have grants
like NIH and NSF and NEH grants. Small archives, we hear from
thousands of small archives that don’t have the money to
preserve their own content. There’s a number thrown out there, I don’t
know how accurate it is at all but accounts on research libraries is estimating 3% of the usable scholarly archives
in the world has been digitized. So whether they’re right or wrong or off
by a factor of 10, we expect a tsunami wave of content that need to get digitized
from all of these small collections that researchers have collected
over the years around the world. Again, this example of the Albanian Human
Rights Testimonies we just got was an example and we’re — instead of having to pay for
their own architecture or figure out how to do it themselves, they can just give us
the content and we’ll process it and put it at the USC Libraries and make
it available for researchers. We’re also doing that for large archives so
like the Shoah Foundation or PBS, et cetera, you’d be able to come to us and instead
of building your own infrastructure, we process the materials and make them
available through the USC Libraries. And then we have commercial entities like the
Academy of Motion Pictures Arts and Sciences and Warner Brothers, et cetera that come to
us and say, “Will you archive our materials” and their contents on pedagogic
interest, we have a big film school. You guys have one too and so people like to
know that that stuff’s being preserved and in — and in special ways can be
available for faculty and students. These are just some of the initial customers
of the Digital Repository that we have. A lot of them are USC-based. We’re just starting to expand
out to other universities. We’d love to have Dartmouth start talking
to us, at least even about backing up things like various histories that
they may have, et cetera. And that is our cloud library for USC. [ Pause ] And that’s what I have. [ Applause ] Are there any questions?>>Who finances all of this?>>So USC finances it now. In the beginning the Shoah Foundation was — a
large portion was financed by Stephen Spielberg and now we’re actually in a large
endowment campaign for the Shoah Foundation. For the Digital Repository, it’s
financed as the USC Libraries is. Most libraries have collections budgets. I know USC’s around $14 million. I don’t know what it is at Dartmouth
but this is sort of the future of how library collections
are going, not just books. But have interesting collections that
are digital and be able to curate them and supply them to faculty and
researchers is becoming important. So people pay to actually use the service
as well as getting it funded through grants. Yes?>>[Background noise] How
do you prepare interviewers?>>How do we prepare interviewers? So interview — we have a pretty
extensive training program for interviewers and when we were doing the mass Shoah Foundation
ones, we would have them all around the world and all of the various countries
that we were in. Now what we do that we’re doing less in the
U.S., we’ll bring them to the Shoah Foundation. But actually what we’re doing for Rwanda and Cambodia is we’re partnering
with non-profits in those areas. For instance in Rwanda, we
partner with a group called Ibuka and what we’re doing is training local
interviewers and videographers through Ibuka to do the Rwandan collection
and testimonies and interviews.>>Do you have a package?>>We do and we actually post a lot of
that on our web site to see how we do that.>>I noticed that you signed
into the Eyewitness. And for an educator, does that mean that
they would have to pay a certain charge to access the activities and the programs
that you offer for faculty and students?>>Eyewitness is completely free. It’s been funded through grants but what we do
for the log-ins is that a teacher comes to us, proves they’re a teacher, gets a log-in and then
they an invite their students in so that they — they work in their own little sandbox. Right now, I’m pretending I’m a teacher
with a lot of students when I log in. Yes?>>With regard to digitalization preservation,
are their other comparable organizations that are — that are doing the same — same job?>>There’s a lot that are doing similar things. Preservation’s a big deal. From a university perspective, I don’t know. We’ve been wagged around at USC by
the Shoah Foundation collection, so we have to eat our own cooking there. And so, I haven’t seen anyone do
all the parts that we’re doing but I’ve seen a lot doing some of the parts. You’ll see at University of Texas and all the
Texas schools, they’ve got an enormous amount of funding during the Bush Adminsitration
to build out a digital library for all the Texas Institutions to store
their stuff and they doing some of this. University of Illinois and also UC-San
Diego has got some storage systems in place but they’re not doing the 20-year bit
preservation part of that right now. The other area that’s doing many of
these pieces is the Library of Congress. A lot of this came from the work done
for the audiovisual conservation center, which was a $250,000,000 facility that
Dave Packard of Hewlett Packard funded to digitize everything that’s copyrighted
in the U.S., that’s film, video, or audio. And so actually our systems
are compliant with that. Since there aren’t really standards in this area
and there’s not a lot of people doing it yet, we figured if we did what the
Library of Congress did that would be as close as we could get at the moment. Yes?>>What do you do about accusations of fraud
not by deniers, but serious people who say, “No that person wasn’t there,
wasn’t doing that or making it up?”>>Let them do it [chuckle]. They’ll make their accusations. And you know, this is memory this is not — this
is one of the actual areas where there’s a lot of challenges in terms of using video and
audiovisual testimony as primary sources. Human beings make mistakes. I have trouble remembering 10 years ago, God
forbid there’s a name or something I need to remember and these people are
remembering events from 50, 60 years ago. And so they get it wrong
sometimes, they misremember things. Do they — some of them for their own personal
reasons may make certain events sound more grand or less grand than they are, but
they’re human beings and that’s one of the challenges of using personal memory. But that’s also one of the great things
about having volume where you can have a lot of data points to look over as you’re doing it. With all the economics — the folks from the
Economics Department and the department you know about using data points to
smooth some of that out.>>So you’re using — you’re using
students to do the cataloging. Is there, in the future, computer
reference to do that instead of having…?>>Yeah, we spent a lot of time and
I’m going to bring up a website. [ Noises ] The question is do we — is there
— do we spend a lot of time — are we looking for automated systems
is what you were asking, I think. And so we got an $8 million grant from
the National Science Foundation to look at using speech recognition and compare
it to what we’re doing manually. The answer is no and I will
now give you the long answer. Basically, there’s three parts
to being able to do recognition. There’s what’s called the recognition aspect,
what are the actual words that are said. Then there’s language processing. Two people say the same thing just a little bit
differently can you map to that idea correctly so that we query on the idea once,
but you get both results back. And then there’s the information
retrieval engines, which bring the information
back once you do that. The speech recognition, when
we started was horrible. It was down around 15% and the things
that were breaking it were people — you know, if someone’s not speaking their
original language, if they get emotional, if they switch languages, if they’re elderly, all those things tend to
break speech recognition. And one of the reasons the NSF liked our
dataset is we had all that with the survivors. And so we were testing against that. We got up to actually 65% accuracy on the
recognition, but that still wasn’t enough to get the language processing where
someone says something a little different than someone else to be able to work before
it gets to the information retrieval engine. So it still flat lines a bit,
although we’ve had great success in the Czech Republic out of Charles University. We got someone to get to 85%
accuracy on Czech testimonies only. We’re trying to raise $12 million
for them so that they can try and get their technique working
on other languages right now. But it’s not there; it’s in
its very beginning stages. For now what we say is people are best at
it and we try and optimize around them. For instance, a transcript takes 12
hours for every hour of video to process. It only takes us two hours for
every hour of video to process with the techniques that
we use to get it through. So for the Shoah Foundation, it was $25
million to catalog it instead of $135 million.>>I noticed that you were doing the
search in English and you were coming up with videos in other languages. Are the testimonies translated at
all for the access to warn viewers?>>No, the testimonies are not translated
at all for access to the other viewers. The question was are they and the answer’s no. What we’ve done is if the
testimony was in Russian, they listen in Russian and
they index in English. So we have an index over all the testimonies
in English and then if someone wants to query in Russian we just translate the keywords
themselves for doing those queries. But we don’t have any vehicle for them to listen
in Russian and have it come out in English yet automatically except that we can say for
each language even where we have something like Italian where we only have 400
interviews, that’s still 1200 hours of video, which is plenty for let’s say an Italian
school system to be able to build the same kind of programs we’re doing for high
schools in the United States.>>Are you getting more information
from the users? In other words, are there
ways in which the people who use the collection can actually contribute
to the collection and maybe even fact check some of the oral histories or in any
case provide additional information that can be part of the metadata?>>Yeah, so the idea of using
voluntary metadata, you know, input of metadata from volunteers I
think is what you’re asking about. And we have a couple of issues there,
you sort of get what you pay for and you’re not quite sure
about the quality of it. The other is, is we tend to have a
really rabid denier community across all of these different collections and they’re
usually the first and the ones to cut and paste to put everything in there
and things get skewed. Not super quickly, but more
quickly than you’d like. So we’ve been looking at ways to
have an authenticated set of tags and then an unauthenticated set of tags. And we’re playing with that. That’s one of the areas that
we’re trying to figure out how that would be useful to researchers. [ Background Noise ]>>Thank you, Sam. [ Applause ]>>Thank you for inviting me.