Mining the Old Bailey

This week on DITA we learnt about data mining and used the API Demonstrator to test data mining out as well as do some text analysis. I was excited to use the Old Bailey API to look at transcripts of some of my favourite cases. That is until I realise all my favourite cases occurred after 1913. Nevertheless, I decided to search crimes involving secondary participation.

pic1First, I used the general search bar on the website. Since secondary participation is a more recent term, I used key words such as ‘aiding’ and ‘abetting’. I was also interested in murder cases in particular and decided to search under these terms and it returned with 172 results.

I then decided to use the API demonstrator, using it to find out about the number of killings that are committed by women and that are found guilty. It’s pretty interesting because it allows you to break it down by subcategories and other aspects of the trial. It’s interesting to see how its format is similar to that of text analysis tools such as voyant.

pic 2 I later drilled it down to the fours trials concerning highway robbery and tried to export the four results but it doesn’t work immediately. I tried again when I got home when I thought there would be less traffic and it worked!

pic 4

Overall , the site worked fairly well despite the troubles with exporting the data to Voyant. I feel like this would be a particularly helpful tool for people taking cases for reference, if people wanted to go through the old law they could take old cases, run it through Voyant and analyse how often certain details of a case crop up such as who heard as well as how relevant legal concepts such as defences were relevant. Part of me wish I knew things such as distant reading, text analysis and coding while I was still doing law… It might have been easier for me to sift through and find cases relevant to my research.

While I was waiting for the API Demonstrator to export my data to Voyant, I decided to move on to the Utrecht University Digital Humanities Lab and read about the ABO (Annotated Books Online) Project which aims to understand how users in the past use their books by what they annotate in them and holds about 60 copies of annotated books.

pic5The search function allows you to search by specific terms such as language.


It searches the text by annotations and will give translations and additional annotations to them. I searched an earlier annotation that had the word ‘deer’ to test how it worked at first. Unlike Old Bailey, it highlights the annotation on the actual scanned document possibly because it is important to understand the context in which the annotation is made. It offers less search conditions compared to Old Bailey but again this could be because of the type of data both of them are collecting, for the Old Bailey it may be more useful for them to be able to sift through the data through categories such as verdict, offence, defendant’s details, etc. It may be harder to find any other way to categorise the annotations other than by language and such.


For some reason transcriptions are not on all the pages, either it is not something they aim to do or that they haven’t transcribed the annotations. You can view several books at the same time which is a nice touch however quite frustrating that it doesn’t point out where the annotations are. I wish there is an option to read through texts that have already been transcribed/annotated but hopefully in the future when there is more done on the project! It’s both interesting in context and I look forward to testing out data mining where more of its information has been digitised!


Cloud Watching

This week on DITA we learnt about distant reading and text analysis and used various online tools to analyse ext.

Distant reading is a form of reading where instead of focusing on an in-depth analysis of one text, many texts are analysed together as a dataset to understand them all. Text analysis as a form of distant reading by analysing large amounts of text for frequency of words appearing, patterns within the text and how often they are used in a particular context. There are various tools that can be used in text analysis and in our lab we tried out just a few to generate text clouds and I did it with an Altmetric report on how often articles about Gender were tweeted in Library and Information Science..

The first one is Wordle which a simple word cloud generator. It gives people the option of changing visuals such as font and colour as well as the number of words used in the cloud. At the most it is only capable of generating a visual of the words


The next one was Many-Eyes, which offers people a few more ways to visualise data besides word clouds including pie charts and graphs. However as much as I wanted to have a word cloud of of this again, it took a long time to get it to visualise one without it crashing. In terms of abilities I find it pretty similar to wordle however with the added choice when it comes to forms of visualisation. It still searches through text by frequency of appearance or alphabetically.


The final one and my personal favourite is Voyant. Voyant not only generates a word cloud but also offers many tools such as editing stop words so you can exclude words that you feel are irrelevant as well as see the number of times each word appears in the text..


Not only that, the user is able to pick and observe specific words. For example if I wanted to know how often science is a subject in the tweets then it can highlight and show where they occurred in the text as well as the context of those words. It could also compare them with different words on a chart and I compared it with the Internet as a way to see how often they appear together and where. Overall it is an effective tool for more detailed text analysis compared to the other two.



This week on DITA we covered altmetrics, which measures the impact of articles and other scolarly documents. There are a few tools available to help understand and observe this impact and the one we used during lab was Almetric, which measures the amount of online attention an article and dataset (with a DOI) gets on social media platforms, literary review, news outlets and reference managers. This does this by using APIs and will track down the number of times it’s been referenced or linked by particular websites.

How Altmetrics work is that a person can view the number of times the blog had been linked in other sites, and would show which sites and readers it had been viewed from.

ALTMETRICAAltmetric compiles all the information on the attention received and gives a score based on the attention received. Each type of website that links the article is given a different weighting, Facebook being the lowest with 0.25 and news being the highest with a score of 8. Altmetric also will attempt to look into each mention when possible to gauge the importance of the source and how many people may reach it as well as any bias that they may have.

ALTMETRICAIt also shows the demographic of the readers viewing the article, both by geography and by type of reader (member of the public, scientist, science communicators, practitioners). Type of readers is discovered by looking at keywords in their profile description and geographic location is found using geolocators.

From this I understand how altmetrics can benefit people who want to know more about the quality of the article or the reception it receives from the public. Unlike citations which only show which journals cite the article, altmetrics can show a greater view of the impact including page views, downloads and more.

However I feel that Altmetric doesn’t give enough information to determine the quality of the article at times. It doesn’t show whether the attention towards the article is positive or negative nor can it tell us anything about he actual validity of the article. Geolocation can only be used when people allow their geolocation to be known and on twitter that makes up only 1% of the users on it. It also doesn’t show us anything about the quality of the researchers using it and comparing it to older articles is difficult when older articles are more likely to receive attention due to time.

I believe altmetrics is useful in finding out more about the impact of the article however there is still so much that it cannot tell us and there is few tools available to help find out such information at this moment.

Reading Week Recap

Not DITA per se but lots of interesting things happened this week that I felt had to be noted.


On Tuesday I went to the British Library with Fengchun. We visited the Treasures of the British Library Exhibition first where I learnt a lot about printing in various countries at different eras which was very interesting. I am also incredibly in love with the amount of detail that goes into the writing and typography of these documents and how they’re still very well preserved despite their age, especially the Qur’ans!! I love seeing the different texts from different countries and how they chose to layout their content, from the old English bibles to the long folded stories of Vietnam. It was a worthwhile exhibition and I would go again just to ogle at them all again!!

Gothic Imagination Exhibition was amazing, I would also go to that again if I could (I am only disappointed there wasn’t more information on werewolves).  I couldn’t get enough of the detail in Fuseli’s artwork, especially his Hamlet piece. I think the most interesting part was hanging around the Frankenstein part of the exhibition where you see a lot of people commenting on why Mary Shelley chose to keep her name anonymous. I overheard some people pointing out how John William (1819) was mentioned before Mary Shelley (1818) and was questioning whether it was a subconscious decision to arrange it male before female rather than chronologically. I found this interesting to note especially because there are visitors who are conscious of this fact. How much of history is structured by stereotypes or subconscious associations?

I liked to watch the progression of gothic literature from its beginnings where it seems mostly supernatural and where setting played a huge role to people admiring monsters and exploring the monster itself in more depth. I personally find the most chilling stories are those where the monster is the human itself such as Dr. Jekyll and Mr. Hyde and The Tell-Tale Heart.

I may go again at a later date to record more information as I wasn’t able to this time.


Attended the KO Goes Mobile event on Wednesday about Knowledge Management and Mobile Technology. Lots of great talks that I’ve enjoyed and will summarise each of them. Time passed by so quickly and I could only lament the fact that I wasn’t able to stay for the last talk!!

 Organising Information For Use on the Touchpad – A person from Touchpress Ltd came to talk about the power of iPad apps in providing information for the public. He showed us how they are essentially interactive books that can help with learning especially for people who find it difficult to grasp certain concepts (Such as the Molecules book where kids can visualise the bonds and how they move and change). I think this is an amazing tool that would be incredibly beneficial for education and I wonder how would we be able to introduce iPads or apps on tablets to schools to help with students’ learning (How can we introduce them into libraries?)

Digirati – This talked about web publishing and how to make it accessible on mobile devices. What I took from it was the question whether it is necessary to optimise all web content for mobile devices. I’ve realised it may not always be the case and even if people do want to access it, they do not need an optimal format when rushed or will prefer to use bigger devices that can view the content better.

Bring your own…Everything! – By far my favourite talk, it was about how using mobile technology can improve productivity in organisations and I admired Sharon Richardson incredibly for her work. She makes points such as how simplicity of devices may be preferable over features as it is easier to teach, learn and share. Mobile devices help connect people and make them more efficient with making decisions due to notifications, which allows people to respond as soon as they see the alert. It may even lead to the restructure of workspaces to be places where people can interact with one another and hold more meetings as well as train staff to know how to filter and prioritise work. I think a lot  can be taken out of this talk on how organisations can be restructured to accomodate mobile technology and encourage people to be more willing to use technology, interact and share which are important traits to have in employees.

Of Birds and Butterflies

All my life I’ve loved all kinds of animals but the creatures that hold the biggest space in my heart are insects.

This may be a bit unusual seeing as most people are quite deterred from them, but I personally find them fascinating and beautiful creatures, small yet complex compared to vertebrates. Which is why when Ernesto mentioned that tweets were like butterflies in this week’s DITA lesson I found myself resonating so strongly with the concept. It opened my eyes to view tweets in a completely different and almost natural way than before.

This week we created an app used to collect and archive tweets using keywords and hashtags, we’ve used TAGS, an application created by Martin Hawsey, Twitter Search API to compile all the tweets including the tag #citylis. The exercise taught me how data could be visualised and I’ve learnt several things such as top tweeters and subjects related to them. It’s interesting to see how data could be generated using apps and I wonder how it could be used to aid further research.

When I read Ernesto’s articles on Twitter being used for public evidence and archiving and storing data sets, I started to understand the importance of twitter data being used in research especially in understand today’s trends. Tweets are already recognised to contain significant information about today’s culture and even the Library of Congress now holds an archive of tweets from 2006-2010. In regards to the ethics behind how to use twitter data, I find that scientists should have the right to use data made available to the public as they would in any public situation where data would be gathered. I think Caitlin M. Rivers’s report on ethical standards when using big data does outline how we should treat datasets while respecting privacy and it works just as most ethical standards for scientific research would.

Tweets are like butterflies in more ways than one: they’re small, numerous and contain huge amounts of data important for study, and unless scientists are able to collect samples you can not be expected to learn more about the subject that’s been studies. It has proven its use in past studies such as the one made by JISC relating to the London riots which I find this a particularly interesting example as it does dismiss original ideas relating to the use of social media using data collected from them. It’s important that we study social network data when understanding today’s culture as it plays such an integral part of people’s lives today that it would be irrational not to take it into account.

To conclude, I’ve decided to rename this blog title to lepidoterans, the taxonomic order of butterflies and moths, to respect this metaphor which joins together two things that I love dearly. I  think  that if tweets are butterflies than we are the information entomologists that study them!