Exploring the Art World

using Bulk Data Collection, Natural Language Processing and Predictive Analytics

Introduction

Art, particularly in the expanded definition of art that characterizes the contemporary, is the product of agency acting on subjective feeling. This can be individual or collective, painting, or conceptual, historical or in the future. However, the openness of such an idea has little to do with the everyday machinations of the economic and social sphere which brings such work from "artists" to a "public".

For those not in the business of curating, criticizing, or selling artwork, this is world is regarded as something opaque. This is partially by nature, as the subjective taste of one is different from that of another person, particularly those with the financial resources to spend on art. This however is also by design, with taste often being set by considerations more related to the artist's marketability than cultural movements.

For this reason, I decided to investigate the relationship between the cycles of publicity in arts and culture writing, and the events which surround artists, as well as those with a strong impact on visual culture.

Experiment Design

For this project, articles were scraped from the websites of publications with significant influence on taste in the Contemporary Art World. While parts of this corpus dated to the 1960s, most of the articles were from the 21st century. The articles were collected alongside information such as date of publishing, source, and where available, metadata such as author, title, image links and tags.

After the collection phase, bulk analysis of the articles was performed. First, the articles were cleaned to remove extraneous data such as links and html tags. Then they were transposed into the ASCII character encoding standard, to remove accents and diacritical marks, making it possible to combine different spellings. After this, the unstructured text was further cleaned and smoothed using tools in the NLTK library. Information was then extracted from the articles with natural language processing techniques from multiple python libraries such as spacy, NLTK, TextBlob, and Gensim.

The Project is written in Python, and the github repository for scraping these articles is here and the language processing functions are here

Entities

First, a pretrained convolutional neural network models from spacy was used to perform Named Entity Recognition (NER) on the corpus, identifying and classifying specific things referred to in the unstructured text. To speed the processing, a fast NER model with accuracy of approximately 85.5% was used. 1 Human accuracy at NER is approximately 93-97%, meaning that 10% of edge cases were lost. 2 From this output, the entities tagged as names or organizations were retained for the database. These identified and then matched to identifying integers for the articles. From the articles database, the metadata for each article was attached to each entity. This resulted in a tyical database entry like below:

entity_id:    179674     | entity_name: Jonas Staal     | article_title: Jonas Staal    | article_id: 676198     | source:    Artforum     |     author: Kate Sutton     |    link:    ['https://www.artforum.com/print/reviews/201306/jonas-staal-41396']    |    publication_time:        6/1/2013     |     filename:    2013_06_01_Jonas_Staal_Artforum.json

Reference Encyclopedia

For the organizations database, a secondary reference encyclopedia was built using several data sources. First, Wikidata was queried for all Items tagged as museums, art galleries, cultural centers, and arts organizations, in any language. This was accomplished with the QWikiData python library and the Wikimedia SparQL service. 3 , 4 The outputs of this inquiry noted the location, name, and a small description of each artistic institution. These were then cross referenced with several other datasets. One, from the city of New York, allowed non represented NYC art galleries to be added to the list. 5 Another, from the Open Data Collective and based on IRS 990 forms, allowed all American Nonprofits to be added. 6

The names in the additional sets were checked for duplicates by comparing the partial Levenshtein distance of the names. All name pairs shown to have a Levenshtein similarity greater than .85, as well as another characteristic (such as location) with a score above .8 were then combined, with preference going to the better formatted Open Data and NYC datasets. This yielded a final product of an encyclopedic global dataset with the strongest representation of American arts and culture organizations. This resulted in a tyical database entry like below:

entity_id:    3944278    |    name:        Philadelphia Museum Of Art    |    country:        United States Of America    |    description:        ['Art Museum', 'Fine Arts', 'Historical Preservation']    |    city:        Philadelphia, PA

First, each article in which an artist is named was collected. Then, using the Latent Dirichlet Allocration algorithm in the Gensim Library, I modeled the topics commonly associated with certain artists and figures. These topics are limited to the raw verbs, nouns and adjectives often mentioned "in the same breath" as the artist. 7

Next, using the TextBlob library, sentiment analysis was performed on each article associated with the entities. The Naive Bayes Analyzer algorithm TextBlob uses yielded two scores for each article, rating the subjectivity and polarity of the text. Subjectivity is a decimal number between 0 and 1, with 0 indicating purely informational statements, and 1 indicating purely emotional statements. Polarity is a decimal between 1 and -1, with 1 indicating wholly positive statements, and -1 indicating wholly negative statements.8 such a model can process negated statements ("It’s not bad today" would be positive) but cannot detect facetious or indirect speech ("goodness aren’t you a ray of sunshine today" would also be positive.) This proved to have a significant effect on the results.

Dataset

As of April 2021, this dataset comprises several groups of files.

The first group is the 250,241 individual articles which were scraped. The sources for these works were Artforum, Artnet, Hyperallergic, the New York Times Arts Section, and Frieze. From these, 4,541,401 different mentions of names and 2,060,365 mentions of organizations were extracted. Each entity has 9 different datapoints, corresponding to the article in which it was mentioned. Of these entities, approximately 521,750 unique names are mentioned, and 296,895 unique organizations mentioned.

However, as this is an onging project, the databases at time of reading are now considerably larger. Additonally, while the source code to gather and process this data is available on github, the dataset itself is from copyrighted material and cannot be shared

Hypothesis

I hypothesized that it would be clear to tell the overall "arc" of an artist or public figures involvement in the cultural sector using the provided data. This limited me to several artists and art related figures of which I am generally aware of their careers. To demonstrate the relationship between the artists, and the kinds of reviews they were receiving, interactive scatter plots were created against each variable, with each dot having associated metadata attached.

For many of the plots below, such interactive scatter plots are accesible by downloading the html file by clicking the image

Experiment 1: The 2017 Whitney Biennial

As an event that would be indicative of both the arc of an artist’s career, as well as positive or negative subjective opinion, I selected the 2017 Whitney Biennial. The biennial, a shower of contemporary art, is well known to set trends in the art world and art market, with outsize influence on the New York art scene. The 2017 biennial however was notable for featuring a controversy over a single artist, with protests and petitions emerging calling for the removal and destruction of the painting "open casket." Painted by artist Dana Schutz, the work depicted the murdered body of Emmett till win an abstracted figurative style. However, Schutz is white, and the painting was widely protested as insensitive to the politics of representation of black trauma, and of "transmute [ting] Black suffering into profit and fun."[9]

With the heated commentary around this work, I used it, and the other artists of the Whitney biennial as a test case for the dataset and model’s ability to demonstrate changes in taste and controversy in the art world.

I theorized that the Whitney biennial would be discernable in the public discussion of all artists, but that only the articles focusing on Schutz would have lower scores. To evaluate this, Figures 1-4 were plotted.

Interactive HTML plot not available due to size constraints.

Figure 1. (Non-Interactive Version) FILE: sub_pol_2017 whitney biennial.html

Viewing all the artists, this graph shows a uniform increase in density as the artists careers build to a successful mid-career appearance in a show like the Whitney, as well as while the actual collected data becomes more complete.

This graph reflects the trends of the data, as while it shows history for some artists, mentions from sources like the New York times only enter the dataset after 2000. Nonetheless, the dataset, relatively comprehensive by 2010, shows a constant level of praise and publicity since the exhibition for the cohort of participants in the show.

Figure 2 (Non-Interactive Version) FILE: sub_pol_dana schutz.html

However, in the graph of Schutz’s mentions, there are several surprising trends. The hypothesis that the sentiment analysis would follow the controversy at the biennial is to a certain extent noticeable. Likewise, as the data is specific to only this artist, the "bump" of publicity the biennial provides is also noticeable, with the large number of dots demonstrating the opening of the show. However, as can be seen, the positivity scores of this group of articles are all very low, reflecting the controversy and heated language use d to describe her work in this instance. However, what is also noticeable is a generalizable trend of gently rising reception in reviews since the controversy at the Whitney.

This plot however reveals some of the limitations of my model. The article about her with the lowest positivity score is not from when there were accusations of racial insensitivity being reported on. Rather, it is an earlier article with the title "The Horrifying Beauty of Dana Schutz's Paintings."[10] this article, a glowing review of a show from 2016, uses language such as "Dana Schutz renders grotesque iconography with painterly virtuosity" - unfortunately, the sentiment analysis is not robust to this kind of figurative language. Negative wording such as "horrifying" or "grotesque" is all registered to be negative.

Figure 3 (Non-Interactive Version) FILE: sub_pol_aliza nisenbaum.html

To control, I compared Schutz’s distribution to another female figurative painter in the show, of a similar age to Schutz, Aliza Nisenbaum.[11] While her career has been in the press for a shorter amount of time, her reviews and mentions show a relatively constant distribution, and she also displays a similar "bump" of articles relating to the biennial.

To further explore different outcomes for artists on a longer timeline, I plotted a later career participant in the show, the installation artist, male, Larry Bell.[12]

Figure 4 (Non-Interactive Version) FILE: sub_pol_larry bell.html

Bell, whose career spans back to the 1960s, shows an interesting set of trends- one is that his work has been spoken about in relatively neutral tones his entire career. Similarly, as he is a late career artist, the biennial had little effect on his distribution- Instead, shows in 2016 in Hong Kong and Los Angeles seem to play a much more prominent role in boosting the density of his striation.

Experiment 2: Non-Artists

To further explore such considerations, the tones detected for non-artist names were also plotted.

The Sackler Family:

The first was the Sackler family, who are well known as generous art patrons, yet who as owners of Purdue Pharma, have increasingly been considered accountable for the Opioid crisis. Indeed, artists have been many of those protesting the influence of the Sacklers on museums, with the artist Nan Goldin leading a particularly salient effort. [13]

I theorized that the Sacklers would be positively regarded until around 2018, when the campaign against their presence on museum boards gained traction in media.

Figure 5 (Non-Interactive Version) FILE: sub_pol_Sackler Family in Culture Writing.html

This is borne out by the data, with most historical mention of the family being neutral or mostly positive, but the data reflecting the increasing association of their name with the Opioid crisis and oxycontin.

Trump Presidency:

Finally, as a gauge of the wider impact of art on politics and vice versa, I tracked all mentions of Donald Trump in culture media. The 2016 election, 2020 election and Trump presidency in general have been widely held to have inspired a wave of protest art, as well as making art much more politically engaged. to measure the effect of this on art writing, I plotted this, marking the salient historical moments.

Like the Sacklers, I expected the art world commentary on Trump to be sparse, yet uniformly neutral, until the campaign and his presidency.

While this is broadly true, an interesting trend is the surprise of the 2016 election- while articles about Trump increase, the bulk of the articles appear after November, rather than before. The data also reflects the trends after the Capital riot in January- the tone of the articles on the former president falls more sharply than even during his presidency, reflecting the much more common written association of his name with negative wording like "coup", "domestic terrorism" and "white supremacy."

Conclusions

I believe that these metrics are only the first step in understanding this data. Natural Language Processing, when reading the text of these articles, reenacts a human action- they are in effect automata. However, no one person could ever sustain such a level of reading. While the data itself is a framework of understanding, the human readable elements, meanings, inferences, learning, are only a surface layer of which the machine is ignorant. This parallels the nature of art history. We can often only understand the visual culture of a certain time after the fact, when the present has been desiccated into factoids on a page. However, such a long view is more difficult in our present. The art world, with its financialized focus on taste and form is no exception from this.

References:

[1] https://spacy.io/usage/facts-figures

[2] https://www-nlpir.nist.gov/related_projects/muc/proceedings/score_reports_index.html

[3] https://github.com/kensho-technologies/qwikidata

[4] https://query.wikidata.org/

[5] https://data.cityofnewyork.us/Recreation/New-York-City-Art-Galleries/tgyc-r5jh

[6] https://nonprofit-open-data-collective.github.io/overview/

[7] https://radimrehurek.com/gensim/auto_examples/index.html#documentation

[8] https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis

[9] https://news.artnet.com/art-world/dana-schutz-painting-emmett-till-whitney-biennial-protest-897929

[10] https://www.frieze.com/article/horrifying-beauty-dana-schutzs-paintings

[11] https://alizanisenbaum.com/

[12] https://larrybell.com/

[13] https://news.artnet.com/art-world/nan-goldin-sackler-met-2019-1460413