Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints § § ‡ Jieyu Zhao Tianlu Wang Mark Yatskar § § Vicente Ordonez Kai-Wei Chang University of Virginia {jz4fu, tw8cb, vicente, kc2wc}@virginia.edu University of Washington [email protected] Abstract tics from images and require large quantities of la- beled data, predominantly retrieved from the web. Language is increasingly being used to de- Methods often combine structured prediction and fine rich visual recognition problems with deep learning to model correlations between la- supporting image collections sourced from bels and images to make judgments that otherwise the web. Structured prediction models are would have weak visual support. For example, in used in these tasks to take advantage of the first image of Figure 1, it is possible to pre- correlations between co-occurring labels dict a spatula by considering that it is a com- and visual input but risk inadvertently en- mon tool used for the activity cooking. Yet such coding social biases found in web corpora. methods run the risk of discovering and exploiting In this work, we study data and models as- societal biases present in the underlying web cor- sociated with multilabel object classifica- pora. Without properly quantifying and reducing tion and visual semantic role labeling. We the reliance on such correlations, broad adoption find that (a) datasets for these tasks con- of these models can have the inadvertent effect of tain significant gender bias and (b) mod- magnifying stereotypes. els trained on these datasets further am- In this paper, we develop a general framework plify existing bias. For example, the ac- for quantifying bias and study two concrete tasks, tivity cooking is over 33% more likely visual semantic role labeling (vSRL) and multil- to involve females than males in a train- abel object classification (MLC). In vSRL, we use ing set, and a trained model further ampli- the imSitu formalism (Yatskar et al., 2016, 2017), fies the disparity to 68% at test time. We where the goal is to predict activities, objects and propose to inject corpus-level constraints the roles those objects play within an activity. For for calibrating existing structured predic- MLC, we use MS-COCO (Lin et al., 2014; Chen tion models and design an algorithm based et al., 2015), a recognition task covering 80 object on Lagrangian relaxation for collective in- classes. We use gender bias as a running example ference. Our method results in almost no and show that both supporting datasets for these performance loss for the underlying recog- 1 tasks are biased with respect to a gender binary . nition task but decreases the magnitude of Our analysis reveals that over 45% and 37% bias amplification by 47.5% and 40.5% for of verbs and objects, respectively, exhibit bias to- multilabel classification and visual seman- ward a gender greater than 2:1. For example, as tic role labeling, respectively. seen in Figure 1, the cooking activity in imSitu is a heavily biased verb. Furthermore, we show 1 Introduction that after training state-of-the-art structured pre- dictors, models amplify the existing bias, by 5.0% Visual recognition tasks involving language, such for vSRL, and 3.6% in MLC. as captioning (Vinyals et al., 2015), visual ques- tion answering (Antol et al., 2015), and visual se- To simplify our analysis, we only consider a gender bi- mantic role labeling (Yatskar et al., 2016), have nary as perceived by annotators in the datasets. We recog- nize that a more fine-grained analysis would be needed for emerged as avenues for expanding the diversity deployment in a production system. Also, note that the pro- of information that can be recovered from im- posed approach can be applied to other NLP tasks and other ages. These tasks aim at extracting rich seman- variables such as identification with a racial or ethnic group. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics COOKING COOKING COOKING COOKING COOKING ROLE VALUE ROLE VALUE ROLE VALUE ROLE VALUE ROLE VALUE AGENT WOMAN AGENT WOMAN AGENT WOMAN AGENT WOMAN AGENT MAN ∅ ∅ FOOD PASTA FOOD FRUIT FOOD FOOD FOOD MEAT HEAT HEAT STOVE HEAT STOVE HEAT STOVE HEAT STOVE TOOL SPATULA TOOL KNIFE TOOL SPATULA TOOL SPATULA TOOL SPATULA PLACE KITCHEN PLACE KITCHEN PLACE OUTSIDE PLACE KITCHEN PLACE KITCHEN Figure 1: Five example images from the imSitu visual semantic role labeling (vSRL) dataset. Each im- age is paired with a table describing a situation: the verb, cooking, its semantic roles, i.e agent, and noun values filling that role, i.e.woman. In the imSitu training set, 33% of cooking images have man in the agent role while the rest have woman. After training a Conditional Random Field (CRF), bias is amplified:man fills 16% ofagent roles in cooking images. To reduce this bias amplification our cal- ibration method adjusts weights of CRF potentials associated with biased predictions. After applying our methods, man appears in the agent role of 20% of cooking images, reducing the bias amplification by 25%, while keeping the CRF vSRL performance unchanged. To mitigate the role of bias amplification when 2 Related Work training models on biased corpora, we propose As intelligence systems start playing important a novel constrained inference framework, called roles in our daily life, ethics in artificial in- RBA, for Reducing Bias Amplification in predic- telligence research has attracted significant in- tions. Our method introduces corpus-level con- terest. It is known that big-data technologies straints so that gender indicators co-occur no more sometimes inadvertently worsen discrimination often together with elements of the prediction task due to implicit biases in data (Podesta et al., than in the original training distribution. For ex- 2014). Such issues have been demonstrated in var- ample, as seen in Figure 1, we would like noun ious learning systems, including online advertise- man to occur in the agent role of the cooking ment systems (Sweeney, 2013), word embedding as often as it occurs in the imSitu training set when models (Bolukbasi et al., 2016; Caliskan et al., evaluating on a development set. We combine 2017), online news (Ross and Carter, 2011), web our calibration constraint with the original struc- search (Kay et al., 2015), and credit score (Hardt tured predictor and use Lagrangian relaxation (Ko- et al., 2016). Data collection biases have been rte and Vygen, 2008; Rush and Collins, 2012) to discussed in the context of creating image cor- reweigh bias creating factors in the original model. pus (Misra et al., 2016; van Miltenburg, 2016) We evaluate our calibration method on imSitu and text corpus (Gordon and Van Durme, 2013; vSRL and COCO MLC and find that in both in- Van Durme, 2010). In contrast, we show that given stances, our models substantially reduce bias am- a gender biased corpus, structured models such as plification. For vSRL, we reduce the average mag- conditional random fields, amplify the bias. nitude of bias amplification by 40.5%. For MLC, The effect of the data imbalance can be easily we are able to reduce the average magnitude of detected and fixed when the prediction task is sim- bias amplification by 47.5%. Overall, our calibra- ple. For example, when classifying binary data tion methods do not affect the performance of the with unbalanced labels (i.e., samples in the major- underlying visual system, while substantially re- ity class dominate the dataset), a classifier trained ducing the reliance of the system on socially bi- exclusively to optimize accuracy learns to always ased correlations . predict the majority label, as the cost of mak- ing mistakes on samples in the minority class can be neglected. Various approaches have been pro- 2 posed to make a “fair” binary classification (Baro- Code and data are available at https://github. com/uclanlp/reducingbias cas and Selbst, 2014; Dwork et al., 2012; Feldman 2980 et al., 2015; Zliobaite, 2015). For structured pre- variable, g, as: diction tasks the effect is harder to quantify and c(o, g) we are the first to propose methods to reduce bias b(o, g) = , c(o, g ) amplification in this context. g ∈G Lagrangian relaxation and dual decomposi- where c(o, g) is the number of occurrences of o tion techniques have been widely used in NLP and g in a corpus. For example, to analyze how tasks (e.g., (Sontag et al., 2011; Rush and Collins, genders of agents and activities are co-related in 2012; Chang and Collins, 2011; Peng et al., 2015)) vSRL, we define the gender bias towardman for for dealing with instance-level constraints. Simi- each verb b(verb, man) as: lar techniques (Chang et al., 2013; Dalvi, 2015) have been applied in handling corpus-level con- c(verb, man) . (1) straints for semi-supervised multilabel classifica- c(verb, man) + c(verb, woman) tion. In contrast to previous works aiming for If b(o, g) > 1/kGk, then o is positively correlated improving accuracy performance, we incorporate corpus-level constraints for reducing gender bias. with g and may exhibit bias. Evaluating bias amplification To evaluate the degree of bias amplification, we propose to com- 3 Visualizing and Quantifying Biases pare bias scores on the training set, b (o, g), with bias scores on an unlabeled evaluation set of im- Modern statistical learning approaches capture ages b(o, g) that has been annotated by a predic- correlations among output variables in order to tor. We assume that the evaluation set is iden- make coherent predictions. However, for real- tically distributed to the training set. There- world applications, some implicit correlations are fore, if o is positively correlated with g (i.e, not appropriate, especially if they are amplified. b (o, g) > 1/kGk) and b(o, g) is larger than In this section, we present a general framework to b (o, g), we say bias has been amplified. For analyze inherent biases learned and amplified by a example, if b (cooking, woman) = .66, and prediction model. b(cooking, woman) = .84, then the bias of woman toward cooking has been amplified. Fi- Identifying bias We consider that prediction nally, we define the mean bias amplification as: problems involve several inter-dependent output variables y , y , ...y , which can be represented 1 2 K X X as a structure y = {y , y , ...y } ∈ Y . This ∗ 1 2 K b(o, g)− b (o, g). |O| is a common setting in NLP applications, includ- o∈{o∈O|b (o,g)>1/kGk} ing tagging, and parsing. For example, in the vSRL task, the output can be represented as a This score estimates the average magnitude of bias structured table as shown in Fig 1. Modern tech- amplification for pairs ofo and g which exhibited niques often model the correlation between the bias. sub-components in y and make a joint prediction 4 Calibration Algorithm over them using a structured prediction model. More details will be provided in Section 4. In this section, we introduce Reducing Bias We assume there is a subset of output vari- Amplification, RBA, a debiasing technique for ables g ⊆ y, g ∈ G that reflects demographic at- calibrating the predictions from a structured pre- tributes such as gender or race (e.g. g ∈ G = diction model. The intuition behind the algorithm {man, woman} is the agent), and there is another is to inject constraints to ensure the model pre- subset of the output o ⊆ y, o ∈ O that are co- dictions follow the distribution observed from the related with g (e.g., o is the activity present in an training data. For example, the constraints added image, such as cooking). The goal is to identify to the vSRL system ensure the gender ratio of each the correlations that are potentially amplified by a verb in Eq. (1) are within a given margin based on learned model. the statistics of the training data. These constraints To achieve this, we define the bias score of a are applied at the corpus level, because comput- given output, o, with respect to a demographic ing gender ratio requires the predictions of all test 2981 instances. As a result, a joint inference over test represents the overall score of an assignment, and instances is required . Solving such a giant in- s (v, i) and s (v, r, i) are the potentials of the sub- θ θ ference problem with constraints is hard. There- assignments. The output space Y contains all fea- fore, we present an approximate inference algo- sible assignments of y and y , which can be rep- v v,r rithm based on Lagrangian relaxation. The advan- resented as instance-wise constraints. For exam- tages of this approach are: ple, the constraint, y = 1 ensures only one activity is assigned to one image. • Our algorithm is iterative, and at each it- eration, the joint inference problem is de- Corpus-level Constraints Our goal is to inject composed to a per-instance basis. This can constraints to ensure the output labels follow a be solved by the original inference algo- desired distribution. For example, we can set a rithm. That is, our approach works as a meta- constraint to ensure the gender ratio for each ac- algorithm and developers do not need to im- tivity in Eq. (1) is within a given margin. Let i i i plement a new inference algorithm. y = {y } ∪ {y } be the output assignment for v v,r 5 ∗ test instance i . For each activity v , the con- • The approach is general and can be applied in straints can be written as any structured model. i v=v ,r∈M • Lagrangian relaxation guarantees the solu- ∗ ∗ b −γ≤ P P ≤b + γ i i y + y tion is optimal if the algorithm converges and ∗ ∗ i v=v ,r∈W i v=v ,r∈M (2) all constraints are satisfied. ∗ ∗ ∗ where b ≡ b (v , man) is the desired gender ra- In practice, it is hard to obtain a solution where tio of an activity v , γ is a user-specified margin. all corpus-level constrains are satisfied. However, M and W are a set of semantic role-values rep- we show that the performance of the proposed ap- resenting the agent as a man or a woman, respec- proach is empirically strong. We use imSitu for tively. vSRL as a running example to explain our algo- Note that the constraints in (2) involve all the rithm. test instances. Therefore, it requires a joint in- Structured Output Prediction As we men- ference over the entire test corpus. In general, tioned in Sec. 3, we assume the structured output these corpus-level constraints can be represented y ∈ Y consists of several sub-components. Given in a form of A y − b ≤ 0, where each row l×K a test instance i as an input, the inference problem in the matrix A ∈ R is the coefficients of one is to find constraint, and b ∈ R . The constrained inference arg max f (y, i), problem can then be formulated as: y∈Y where f (y, i) is a scoring function based on a max f (y , i), i i {y }∈{Y } model θ learned from the training data. The struc- X (3) tured output y and the scoring function f (y, i) can s.t. A y − b ≤ 0, be decomposed into small components based on an independence assumption. For example, in the vSRL task, the output y consists of two types of where {Y } represents a space spanned by possi- binary output variables{y } and{y }. The vari- ble combinations of labels for all instances. With- v v,r able y = 1 if and only if the activity v is chosen. out the corpus-level constraints, Eq. (3) can be Similarly, y = 1 if and only if both the activity v optimized by maximizing each instance i v,r and the semantic role r are assigned . The scoring max f (y , i), function f (y, i) is decomposed accordingly such y ∈Y that: X X separately. f (y, i) = y s (v, i) + y s (v, r, i), θ v θ v,r θ v v,r Lagrangian Relaxation Eq. (3) can be solved by several combinatorial optimization methods. A sufficiently large sample of test instances must be used so that bias statistics can be estimated. In this work we use For example, one can represent the problem as an the entire test set for each respective problem. 4 5 We use r to refer to a combination of role and noun. For For the sake of simplicity, we abuse the notations and use example, one possible value indicates an agent is a woman. i to represent both input and data index. 2982 Dataset Task Images O-Type kOk role in vSRL, and any occurrence in text associ- imSitu vSRL 60,000 verb 212 ated with the images in MLC. Problem statistics MS-COCO MLC 25,000 object 66 are summarized in Table 1. We also provide setup details for our calibration method. Table 1: Statistics for the two recognition prob- 5.1 Visual Semantic Role Labeling lems. In vSRL, we consider gender bias relating to verbs, while in MLC we consider the gender Dataset We evaluate on imSitu (Yatskar et al., bias related to objects. 2016) where activity classes are drawn from verbs and roles in FrameNet (Baker et al., 1998) and noun categories are drawn from WordNet (Miller integer linear program and solve it using an off- et al., 1990). The original dataset includes about the-shelf solver (e.g., Gurobi (Gurobi Optimiza- 125,000 images with 75,702 for training, 25,200 tion, 2016)). However, Eq. (3) involves all test in- for developing, and 25,200 for test. However, the stances. Solving a constrained optimization prob- dataset covers many non-human oriented activities lem on such a scale is difficult. Therefore, we con- (e.g., rearing, retrieving, and wagging), sider relaxing the constraints and solve Eq. (3) us- so we filter out these verbs, resulting in 212 verbs, ing a Lagrangian relaxation technique (Rush and leaving roughly 60,000 of the original 125,000 im- Collins, 2012). We introduce a Lagrangian multi- ages in the dataset. plier λ ≥ 0 for each corpus-level constraint. The Lagrangian is Model We build on the baseline CRF released with the data, which has been shown effective L(λ,{y }) = compared to a non-structured prediction base- X X X (4) line (Yatskar et al., 2016). The model decomposes i i f (y )− λ A y − b , θ j j j the probability of a realized situation, y, the com- i j=1 i bination of activity, v, and realized frame, a set of where all the λ ≥ 0,∀j ∈ {1, . . . , l}. The solu- semantic (role,noun) pairs (e, n ), given an image j e tion of Eq. (3) can be obtained by the following i as : iterative procedure: p(y|i; θ) ∝ ψ(v, i; θ) ψ(v, e, n , i; θ) 1) At iteration t, get the output solution of each (e,n )∈R instance i where each potential value in the CRF for subpart i,(t) (t−1) y = argmax L(λ , y) (5) x, is computed using features f from the VGG y∈Y convolutional neural network (Simonyan and Zis- serman, 2014) on an input image, as follows: 2) update the Lagrangian multipliers. w f +b i x ψ(x, i; θ) = e , (t) (t−1) i,(t) λ =max 0, λ + η(Ay − b) , where w and b are the parameters of an affine transformation layer. The model explicitly cap- (0) where λ = 0. η is the learning rate for updat- tures the correlation between activities and nouns (t−1) ing λ. Note that with a fixedλ , Eq. (5) can in semantic roles, allowing it to learn common pri- be solved using the original inference algorithms. ors. We use a model pretrained on the original task The algorithm loops until all constraints are satis- with 504 verbs. fied (i.e. optimal solution achieved) or reach max- imal number of iterations. 5.2 Multilabel Classification Dataset We use MS-COCO (Lin et al., 2014), 5 Experimental Setup a common object detection benchmark, for multi- In this section, we provide details about the two vi- label object classification. The dataset contains 80 sual recognition tasks we evaluated for bias: visual object types but does not make gender distinctions semantic role labeling (vSRL), and multi-label between man and woman. We use the five asso- classification (MLC). We focus on gender, defin- ciated image captions available for each image in ing G = {man, woman} and focus on the agent this dataset to annotate the gender of people in the 2983 images. If any of the captions mention the word 6.1 Visual Semantic Role Labeling man or woman we mark it, removing any images imSitu is gender biased In Figure 2(a), along that mention both genders. Finally, we filter any the x-axis, we show the male favoring bias of im- object category not strongly associated with hu- Situ verbs. Overall, the dataset is heavily biased mans by removing objects that do not occur with toward male agents, with 64.6% of verbs favoring man or woman at least 100 times in the training a male agent by an average bias of 0.707 (roughly set, leaving a total of 66 objects. 3:1 male). Nearly half of verbs are extremely bi- ased in the male or female direction: 46.95% of Model For this multi-label setting, we adapt a verbs favor a gender with a bias of at least 0.7. similar model as the structured CRF we use for Figure 2(a) contains several activity labels reveal- vSRL. We decompose the joint probability of the ing problematic biases. For example, shopping, output y, consisting of all object categories, c, and microwaving and washing are biased toward gender of the person, g, given an image i as: a female agent. Furthermore, several verbs such as driving, shooting, and coaching are p(y|i; θ) ∝ ψ(g, i; θ) ψ(g, c, i; θ) heavily biased toward a male agent. c∈y Training on imSitu amplifies bias In Fig- where each potential value for x, is computed us- ure 2(a), along the y-axis, we show the ratio of ing features, f , from a pretrained ResNet-50 con- male agents (% of total people) in predictions on volutional neural network evaluated on the image, an unseen development set. The mean bias ampli- w f +b fication in the development set is high, 0.050 on i x ψ(x, i; θ) = e . average, with 45.75% of verbs exhibiting ampli- fication. Biased verbs tend to have stronger am- We trained a model using SGD with learning rate plification: verbs with training bias over 0.7 in −5 −4 10 , momentum 0.9 and weight-decay 10 , fine either the male or female direction have a mean tuning the initial visual network, for 50 epochs. amplification of0.072. Several already problem- atic biases have gotten much worse. For example, 5.3 Calibration serving, only had a small bias toward females The inference problems for both models are: in the training set, 0.402, is now heavily biased toward females, 0.122. The verb tuning, origi- arg max f (y, i) = log p(y|i; θ). nally heavily biased toward males, 0.878, now has y∈Y exclusively male agents. We use the algorithm in Sec. (4) to calibrate the 6.2 Multilabel Classification predictions using model θ. Our calibration tries to enforce gender statistics derived from the training MS-COCO is gender biased In Figure 2(b) set of corpus applicable for each recognition prob- along the x-axis, similarly to imSitu, we ana- lem. For all experiments, we try to match gen- lyze bias of objects in MS-COCO with respect der ratios on the test set within a margin of .05 of to males. MS-COCO is even more heavily bi- their value on the training set. While we do adjust ased toward men than imSitu, with 86.6% of ob- the output on the test set, we never use the ground jects biased toward men, but with smaller average truth on the test set and instead working from the magnitude, 0.65. One third of the nouns are ex- assumption that it should be similarly distributed tremely biased toward males, 37.9% of nouns fa- as the training set. When running the debiasing al- vor men with a bias of at least 0.7. Some prob- −1 gorithm, we set η = 10 and optimize for 100 lematic examples include kitchen objects such as iterations. knife, fork, or spoon being more biased to- ward woman. Outdoor recreation related objects 6 Bias Analysis such tennis racket, snowboard and boat tend to be more biased toward men. In this section, we use the approaches outlined in Section 3 to quantify the bias and bias amplifi- 6 In this gender binary, bias toward woman is 1− the bias cation in the vSRL and the MLC tasks. toward man 2984 1.0 1.0 tuning snowboard coaching motorcycle aiming shoveling tie 0.9 boat shooting 0.8 traffic light skis pumping keyboard 0.8 driving hot dog 0.6 0.7 tennis racket 0.6 wine glass 0.4 0.5 spoon shopping 0.2 cooking serving 0.4 handbag knife combing fork microwaving twisting washing 0.0 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (a) Bias analysis on imSitu vSRL (b) Bias analysis on MS-COCO MLC Figure 2: Gender bias analysis of imSitu vSRL and MS-COCO MLC. (a) gender bias of verbs toward man in the training set versus bias on a predicted development set. (b) gender bias of nouns toward man in the training set versus bias on the predicted development set. Values near zero indicate bias toward woman while values near 0.5 indicate unbiased variables. Across both dataset, there is significant bias toward males, and significant bias amplification after training on biased training data. Training on MS-COCO amplifies bias In Fig- 7 Calibration Results ure 2(b), along the y-axis, we show the ratio of We test our methods for reducing bias amplifica- man (% of both gender) in predictions on an un- tion in two problem settings: visual semantic role seen development set. The mean bias amplifica- labeling in the imSitu dataset (vSRL) and multil- tion across all objects is 0.036, with 65.67% of abel image classification in MS-COCO (MLC). In nouns exhibiting amplification. Larger training all settings we derive corpus constraints using the bias again tended to indicate higher bias amplifi- training set and then run our calibration method in cation: biased objects with training bias over 0.7 batch on either the development or testing set. Our had mean amplification of0.081. Again, several results are summarized in Table 2 and Figure 3. problematic biases have now been amplified. For example, kitchen categories already biased toward 7.1 Visual Semantic Role Labeling females such as knife, fork and spoon have Our quantitative results are summarized in the first all been amplified. Technology oriented categories two sections of Table 2. On the development initially biased toward men such as keyboard set, the number of verbs whose bias exceed the and mouse have each increased their bias toward original bias by over 5% decreases 30.5% (Viol.). males by over 0.100. Overall, we are able to significantly reduce bias 6.3 Discussion amplification in vSRL by 52% on the develop- We confirmed our hypothesis that (a) both the im- ment set (Amp. bias). We evaluate the under- lying recognition performance using the standard Situ and MS-COCO datasets, gathered from the measure in vSRL: top-1 semantic role accuracy, web, are heavily gender biased and that (b) mod- which tests how often the correct verb was pre- els trained to perform prediction on these datasets dicted and the noun value was correctly assigned amplify the existing gender bias when evaluated on development data. Furthermore, across both to a semantic role. Our calibration method results in a negligible decrease in performance (Perf.). In datasets, we showed that the degree of bias am- plification was related to the size of the initial Figure 3(c) we can see that the overall distance to the training set distribution after applying RBA de- bias, with highly biased object and verb categories exhibiting more bias amplification. Our results creased significantly, over 39%. demonstrate that care needs be taken in deploying Figure 3(e) demonstrates that across all initial such uncalibrated systems otherwise they could training bias, RBA is able to reduce bias amplifi- not only reinforce existing social bias but actually cation. In general, RBA struggles to remove bias make them worse. amplification in areas of low initial training bias, predicted gender ratio predicted gender ratio 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.6 0.7 0.4 0.6 0.2 0.5 0.0 0.4 0.2 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (a) Bias analysis on imSitu vSRL without RBA (b) Bias analysis on MS-COCO MLC without RBA 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.6 0.7 0.4 0.6 0.2 0.5 0.0 0.4 0.2 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (c) Bias analysis on imSitu vSRL with RBA (d) Bias analysis on MS-COCO MLC with RBA 0.10 0.08 0.07 0.08 0.06 0.05 0.06 0.04 0.04 0.03 0.02 0.02 0.01 0.00 0.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (e) Bias in vSRL with (blue) / without (red) RBA (f) Bias in MLC with (blue) / without (red) RBA Figure 3: Results of reducing bias amplification using RBA on imSitu vSRL and MS-COCO MLC. Figures 3(a)-(d) show initial training set bias along the x-axis and development set bias along the y- axis. Dotted blue lines indicate the 0.05 margin used in RBA, with points violating the margin shown in red while points meeting the margin are shown in green. Across both settings adding RBA signifi- cantly reduces the number of violations, and reduces the bias amplification significantly. Figures 3(e)-(f) demonstrate bias amplification as a function of training bias, with and without RBA. Across all initial training biases, RBA is able to reduce the bias amplification. predicted gender ratio predicted gender ratio mean bias amplification mean bias amplification predicted gender ratio predicted gender ratio Method Viol. Amp. bias Perf. (%) progress with little or no loss in underlying recog- vSRL: Development Set nition performance. Across both problems, RBA CRF 154 0.050 24.07 was able to reduce bias amplification at all initial CRF + RBA 107 0.024 23.97 values of training bias. vSRL: Test Set 8 Conclusion CRF 149 0.042 24.14 CRF + RBA 102 0.025 24.01 Structured prediction models can leverage correla- MLC: Development Set tions that allow them to make correct predictions CRF 40 0.032 45.27 even with very little underlying evidence. Yet such CRF + RBA 24 0.022 45.19 models risk potentially leveraging social bias in MLC: Test Set their training data. In this paper, we presented a CRF 38 0.040 45.40 general framework for visualizing and quantify- CRF + RBA 16 0.021 45.38 ing biases in such models and proposed RBA to calibrate their predictions under two different set- Table 2: Number of violated constraints, mean tings. Taking gender bias as an example, our anal- amplified bias, and test performance before and af- ysis demonstrates that conditional random fields ter calibration using RBA. The test performances can amplify social bias from data while our ap- of vSRL and MLC are measured by top-1 seman- proach RBA can help to reduce the bias. tic role accuracy and top-1 mean average preci- Our work is the first to demonstrate structured sion, respectively. prediction models amplify bias and the first to propose methods for reducing this effect but sig- nificant avenues for future work remain. While likely because bias is encoded in image statistics RBA can be applied to any structured predic- and cannot be removed as effectively with an im- tor, it is unclear whether different predictors am- age agnostic adjustment. Results on the test set plify bias more or less. Furthermore, we pre- support our development set results: we decrease sented only one method for measuring bias. More bias amplification by 40.5% (Amp. bias). extensive analysis could explore the interaction among predictor, bias measurement, and bias de- 7.2 Multilabel Classification amplification method. Future work also includes Our quantitative results on MS-COCO RBA are applying bias reducing methods in other struc- summarized in the last two sections of Table 2. tured domains, such as pronoun reference resolu- Similarly to vSRL, we are able to reduce the num- tion (Mitkov, 2014). ber of objects whose bias exceeds the original training bias by 5%, by 40% (Viol.). Bias amplifi- Acknowledgement This work was supported in cation was reduced by 31.3% on the development part by National Science Foundation Grant IIS- set (Amp. bias). The underlying recognition sys- 1657193 and two NVIDIA Hardware Grants. tem was evaluated by the standard measure: top- 1 mean average precision, the precision averaged References across object categories. Our calibration method results in a negligible loss in performance. In Fig- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- ure 3(d), we demonstrate that we substantially re- garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question an- duce the distance between training bias and bias swering. In Proceedings of the IEEE International in the development set. Finally, in Figure 3(f) we Conference on Computer Vision, pages 2425–2433. demonstrate that we decrease bias amplification for all initial training bias settings. Results on the Collin F Baker, Charles J Fillmore, and John B Lowe. 1998. The Berkeley framenet project. In Proceed- test set support our development results: we de- ings of the Annual Meeting of the Association for crease bias amplification by 47.5% (Amp. bias). Computational Linguistics (ACL), pages 86–90. 7.3 Discussion Solon Barocas and Andrew D Selbst. 2014. Big data’s disparate impact. Available at SSRN 2477899. We have demonstrated that RBA can significantly reduce bias amplification. While were not able to Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, remove all amplification, we have made significant Venkatesh Saligrama, and Adam T Kalai. 2016. 2987 Man is to computer programmer as woman is to Tsung-Yi Lin, Michael Maire, Serge Belongie, James homemaker? debiasing word embeddings. In The Hays, Pietro Perona, Deva Ramanan, Piotr Dollar ´ , Conference on Advances in Neural Information Pro- and C Lawrence Zitnick. 2014. Microsoft coco: cessing Systems (NIPS), pages 4349–4357. Common objects in context. In European Confer- ence on Computer Vision, pages 740–755. Springer. Aylin Caliskan, Joanna J Bryson, and Arvind G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and Narayanan. 2017. Semantics derived automatically K.J. Miller. 1990. Wordnet: An on-line lexical from language corpora contain human-like biases. database. International Journal of Lexicography, Science, 356(6334):183–186. 3(4):235–312. Kai-Wei Chang, S. Sundararajan, and S. Sathiya Emiel van Miltenburg. 2016. Stereotyping and bias in Keerthi. 2013. Tractable semi-supervised learning the flickr30k dataset. MMC. of complex structured prediction models. In Pro- ceedings of the European Conference on Machine Ishan Misra, C Lawrence Zitnick, Margaret Mitchell, Learning (ECML), pages 176–191. and Ross Girshick. 2016. Seeing through the human reporting bias: Visual classifiers from noisy human- Yin-Wen Chang and Michael Collins. 2011. Exact de- centric labels. In Conference on Computer Vision coding of phrase-based translation models through and Pattern Recognition (CVPR), pages 2930–2939. Lagrangian relaxation. In EMNLP, pages 26–37. Ruslan Mitkov. 2014. Anaphora resolution. Rout- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ledge. ishna Vedantam, Saurabh Gupta, Piotr Dollar ´ , and C Lawrence Zitnick. 2015. Microsoft coco captions: Nanyun Peng, Ryan Cotterell, and Jason Eisner. 2015. Data collection and evaluation server. arXiv preprint Dual decomposition inference for graphical models arXiv:1504.00325. over strings. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages Bhavana Bharat Dalvi. 2015. Constrained Semi- 917–927. supervised Learning in the Presence of Unantici- pated Classes. Ph.D. thesis, Google Research. John Podesta, Penny Pritzker, Ernest J. Moniz, John Holdren, and Jefrey Zients. 2014. Big data: Seiz- Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer ing opportunities and preserving values. Executive Reingold, and Richard Zemel. 2012. Fairness Office of the President . through awareness. In Proceedings of the 3rd In- novations in Theoretical Computer Science Confer- Karen Ross and Cynthia Carter. 2011. Women and ence, pages 214–226. ACM. news: A long and winding road. Media, Culture & Society, 33(8):1148–1165. Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubrama- Alexander M Rush and Michael Collins. 2012. A Tuto- nian. 2015. Certifying and removing disparate im- rial on Dual Decomposition and Lagrangian Relax- pact. In Proceedings of International Conference ation for Inference in Natural Language Processing. on Knowledge Discovery and Data Mining (KDD), Journal of Artificial Intelligence Research , 45:305– pages 259–268. Jonathan Gordon and Benjamin Van Durme. 2013. Re- Karen Simonyan and Andrew Zisserman. 2014. Very porting bias and knowledge extraction. Automated deep convolutional networks for large-scale image Knowledge Base Construction (AKBC). recognition. arXiv preprint arXiv:1409.1556. Inc. Gurobi Optimization. 2016. Gurobi optimizer ref- David Sontag, Amir Globerson, and Tommi Jaakkola. erence manual. 2011. Introduction to dual decomposition for infer- ence. Optimization for Machine Learning, 1:219– Moritz Hardt, Eric Price, Nati Srebro, et al. 2016. Equality of opportunity in supervised learning. In Conference on Neural Information Processing Sys- Latanya Sweeney. 2013. Discrimination in online ad tems (NIPS), pages 3315–3323. delivery. Queue, 11(3):10. Matthew Kay, Cynthia Matuszek, and Sean A Munson. Benjamin D Van Durme. 2010. Extracting implicit 2015. Unequal representation and gender stereo- knowledge from text. Ph.D. thesis, University of types in image search results for occupations. In Rochester. Human Factors in Computing Systems, pages 3819– 3828. ACM. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural im- Bernhard Korte and Jens Vygen. 2008. Combinatorial age caption generator. In Proceedings of the IEEE Optimization: Theory and Application. Springer Conference on Computer Vision and Pattern Recog- Verlag. nition, pages 3156–3164. 2988 Mark Yatskar, Vicente Ordonez, Luke Zettlemoyer, and Ali Farhadi. 2017. Commonly uncommon: Seman- tic sparsity in situation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 5534–5542. Indre Zliobaite. 2015. A survey on measuring indirect discrimination in machine learning. arXiv preprint arXiv:1511.00148. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing Unpaywall

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Proceedings of the 2017 Conference on Empirical Methods in Natural Language ProcessingJan 1, 2017

Loading next page...
 
/lp/unpaywall/men-also-like-shopping-reducing-gender-bias-amplification-using-corpus-NqOLbLCdTa

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
DOI
10.18653/v1/d17-1323
Publisher site
See Article on Publisher Site

Abstract

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints § § ‡ Jieyu Zhao Tianlu Wang Mark Yatskar § § Vicente Ordonez Kai-Wei Chang University of Virginia {jz4fu, tw8cb, vicente, kc2wc}@virginia.edu University of Washington [email protected] Abstract tics from images and require large quantities of la- beled data, predominantly retrieved from the web. Language is increasingly being used to de- Methods often combine structured prediction and fine rich visual recognition problems with deep learning to model correlations between la- supporting image collections sourced from bels and images to make judgments that otherwise the web. Structured prediction models are would have weak visual support. For example, in used in these tasks to take advantage of the first image of Figure 1, it is possible to pre- correlations between co-occurring labels dict a spatula by considering that it is a com- and visual input but risk inadvertently en- mon tool used for the activity cooking. Yet such coding social biases found in web corpora. methods run the risk of discovering and exploiting In this work, we study data and models as- societal biases present in the underlying web cor- sociated with multilabel object classifica- pora. Without properly quantifying and reducing tion and visual semantic role labeling. We the reliance on such correlations, broad adoption find that (a) datasets for these tasks con- of these models can have the inadvertent effect of tain significant gender bias and (b) mod- magnifying stereotypes. els trained on these datasets further am- In this paper, we develop a general framework plify existing bias. For example, the ac- for quantifying bias and study two concrete tasks, tivity cooking is over 33% more likely visual semantic role labeling (vSRL) and multil- to involve females than males in a train- abel object classification (MLC). In vSRL, we use ing set, and a trained model further ampli- the imSitu formalism (Yatskar et al., 2016, 2017), fies the disparity to 68% at test time. We where the goal is to predict activities, objects and propose to inject corpus-level constraints the roles those objects play within an activity. For for calibrating existing structured predic- MLC, we use MS-COCO (Lin et al., 2014; Chen tion models and design an algorithm based et al., 2015), a recognition task covering 80 object on Lagrangian relaxation for collective in- classes. We use gender bias as a running example ference. Our method results in almost no and show that both supporting datasets for these performance loss for the underlying recog- 1 tasks are biased with respect to a gender binary . nition task but decreases the magnitude of Our analysis reveals that over 45% and 37% bias amplification by 47.5% and 40.5% for of verbs and objects, respectively, exhibit bias to- multilabel classification and visual seman- ward a gender greater than 2:1. For example, as tic role labeling, respectively. seen in Figure 1, the cooking activity in imSitu is a heavily biased verb. Furthermore, we show 1 Introduction that after training state-of-the-art structured pre- dictors, models amplify the existing bias, by 5.0% Visual recognition tasks involving language, such for vSRL, and 3.6% in MLC. as captioning (Vinyals et al., 2015), visual ques- tion answering (Antol et al., 2015), and visual se- To simplify our analysis, we only consider a gender bi- mantic role labeling (Yatskar et al., 2016), have nary as perceived by annotators in the datasets. We recog- nize that a more fine-grained analysis would be needed for emerged as avenues for expanding the diversity deployment in a production system. Also, note that the pro- of information that can be recovered from im- posed approach can be applied to other NLP tasks and other ages. These tasks aim at extracting rich seman- variables such as identification with a racial or ethnic group. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics COOKING COOKING COOKING COOKING COOKING ROLE VALUE ROLE VALUE ROLE VALUE ROLE VALUE ROLE VALUE AGENT WOMAN AGENT WOMAN AGENT WOMAN AGENT WOMAN AGENT MAN ∅ ∅ FOOD PASTA FOOD FRUIT FOOD FOOD FOOD MEAT HEAT HEAT STOVE HEAT STOVE HEAT STOVE HEAT STOVE TOOL SPATULA TOOL KNIFE TOOL SPATULA TOOL SPATULA TOOL SPATULA PLACE KITCHEN PLACE KITCHEN PLACE OUTSIDE PLACE KITCHEN PLACE KITCHEN Figure 1: Five example images from the imSitu visual semantic role labeling (vSRL) dataset. Each im- age is paired with a table describing a situation: the verb, cooking, its semantic roles, i.e agent, and noun values filling that role, i.e.woman. In the imSitu training set, 33% of cooking images have man in the agent role while the rest have woman. After training a Conditional Random Field (CRF), bias is amplified:man fills 16% ofagent roles in cooking images. To reduce this bias amplification our cal- ibration method adjusts weights of CRF potentials associated with biased predictions. After applying our methods, man appears in the agent role of 20% of cooking images, reducing the bias amplification by 25%, while keeping the CRF vSRL performance unchanged. To mitigate the role of bias amplification when 2 Related Work training models on biased corpora, we propose As intelligence systems start playing important a novel constrained inference framework, called roles in our daily life, ethics in artificial in- RBA, for Reducing Bias Amplification in predic- telligence research has attracted significant in- tions. Our method introduces corpus-level con- terest. It is known that big-data technologies straints so that gender indicators co-occur no more sometimes inadvertently worsen discrimination often together with elements of the prediction task due to implicit biases in data (Podesta et al., than in the original training distribution. For ex- 2014). Such issues have been demonstrated in var- ample, as seen in Figure 1, we would like noun ious learning systems, including online advertise- man to occur in the agent role of the cooking ment systems (Sweeney, 2013), word embedding as often as it occurs in the imSitu training set when models (Bolukbasi et al., 2016; Caliskan et al., evaluating on a development set. We combine 2017), online news (Ross and Carter, 2011), web our calibration constraint with the original struc- search (Kay et al., 2015), and credit score (Hardt tured predictor and use Lagrangian relaxation (Ko- et al., 2016). Data collection biases have been rte and Vygen, 2008; Rush and Collins, 2012) to discussed in the context of creating image cor- reweigh bias creating factors in the original model. pus (Misra et al., 2016; van Miltenburg, 2016) We evaluate our calibration method on imSitu and text corpus (Gordon and Van Durme, 2013; vSRL and COCO MLC and find that in both in- Van Durme, 2010). In contrast, we show that given stances, our models substantially reduce bias am- a gender biased corpus, structured models such as plification. For vSRL, we reduce the average mag- conditional random fields, amplify the bias. nitude of bias amplification by 40.5%. For MLC, The effect of the data imbalance can be easily we are able to reduce the average magnitude of detected and fixed when the prediction task is sim- bias amplification by 47.5%. Overall, our calibra- ple. For example, when classifying binary data tion methods do not affect the performance of the with unbalanced labels (i.e., samples in the major- underlying visual system, while substantially re- ity class dominate the dataset), a classifier trained ducing the reliance of the system on socially bi- exclusively to optimize accuracy learns to always ased correlations . predict the majority label, as the cost of mak- ing mistakes on samples in the minority class can be neglected. Various approaches have been pro- 2 posed to make a “fair” binary classification (Baro- Code and data are available at https://github. com/uclanlp/reducingbias cas and Selbst, 2014; Dwork et al., 2012; Feldman 2980 et al., 2015; Zliobaite, 2015). For structured pre- variable, g, as: diction tasks the effect is harder to quantify and c(o, g) we are the first to propose methods to reduce bias b(o, g) = , c(o, g ) amplification in this context. g ∈G Lagrangian relaxation and dual decomposi- where c(o, g) is the number of occurrences of o tion techniques have been widely used in NLP and g in a corpus. For example, to analyze how tasks (e.g., (Sontag et al., 2011; Rush and Collins, genders of agents and activities are co-related in 2012; Chang and Collins, 2011; Peng et al., 2015)) vSRL, we define the gender bias towardman for for dealing with instance-level constraints. Simi- each verb b(verb, man) as: lar techniques (Chang et al., 2013; Dalvi, 2015) have been applied in handling corpus-level con- c(verb, man) . (1) straints for semi-supervised multilabel classifica- c(verb, man) + c(verb, woman) tion. In contrast to previous works aiming for If b(o, g) > 1/kGk, then o is positively correlated improving accuracy performance, we incorporate corpus-level constraints for reducing gender bias. with g and may exhibit bias. Evaluating bias amplification To evaluate the degree of bias amplification, we propose to com- 3 Visualizing and Quantifying Biases pare bias scores on the training set, b (o, g), with bias scores on an unlabeled evaluation set of im- Modern statistical learning approaches capture ages b(o, g) that has been annotated by a predic- correlations among output variables in order to tor. We assume that the evaluation set is iden- make coherent predictions. However, for real- tically distributed to the training set. There- world applications, some implicit correlations are fore, if o is positively correlated with g (i.e, not appropriate, especially if they are amplified. b (o, g) > 1/kGk) and b(o, g) is larger than In this section, we present a general framework to b (o, g), we say bias has been amplified. For analyze inherent biases learned and amplified by a example, if b (cooking, woman) = .66, and prediction model. b(cooking, woman) = .84, then the bias of woman toward cooking has been amplified. Fi- Identifying bias We consider that prediction nally, we define the mean bias amplification as: problems involve several inter-dependent output variables y , y , ...y , which can be represented 1 2 K X X as a structure y = {y , y , ...y } ∈ Y . This ∗ 1 2 K b(o, g)− b (o, g). |O| is a common setting in NLP applications, includ- o∈{o∈O|b (o,g)>1/kGk} ing tagging, and parsing. For example, in the vSRL task, the output can be represented as a This score estimates the average magnitude of bias structured table as shown in Fig 1. Modern tech- amplification for pairs ofo and g which exhibited niques often model the correlation between the bias. sub-components in y and make a joint prediction 4 Calibration Algorithm over them using a structured prediction model. More details will be provided in Section 4. In this section, we introduce Reducing Bias We assume there is a subset of output vari- Amplification, RBA, a debiasing technique for ables g ⊆ y, g ∈ G that reflects demographic at- calibrating the predictions from a structured pre- tributes such as gender or race (e.g. g ∈ G = diction model. The intuition behind the algorithm {man, woman} is the agent), and there is another is to inject constraints to ensure the model pre- subset of the output o ⊆ y, o ∈ O that are co- dictions follow the distribution observed from the related with g (e.g., o is the activity present in an training data. For example, the constraints added image, such as cooking). The goal is to identify to the vSRL system ensure the gender ratio of each the correlations that are potentially amplified by a verb in Eq. (1) are within a given margin based on learned model. the statistics of the training data. These constraints To achieve this, we define the bias score of a are applied at the corpus level, because comput- given output, o, with respect to a demographic ing gender ratio requires the predictions of all test 2981 instances. As a result, a joint inference over test represents the overall score of an assignment, and instances is required . Solving such a giant in- s (v, i) and s (v, r, i) are the potentials of the sub- θ θ ference problem with constraints is hard. There- assignments. The output space Y contains all fea- fore, we present an approximate inference algo- sible assignments of y and y , which can be rep- v v,r rithm based on Lagrangian relaxation. The advan- resented as instance-wise constraints. For exam- tages of this approach are: ple, the constraint, y = 1 ensures only one activity is assigned to one image. • Our algorithm is iterative, and at each it- eration, the joint inference problem is de- Corpus-level Constraints Our goal is to inject composed to a per-instance basis. This can constraints to ensure the output labels follow a be solved by the original inference algo- desired distribution. For example, we can set a rithm. That is, our approach works as a meta- constraint to ensure the gender ratio for each ac- algorithm and developers do not need to im- tivity in Eq. (1) is within a given margin. Let i i i plement a new inference algorithm. y = {y } ∪ {y } be the output assignment for v v,r 5 ∗ test instance i . For each activity v , the con- • The approach is general and can be applied in straints can be written as any structured model. i v=v ,r∈M • Lagrangian relaxation guarantees the solu- ∗ ∗ b −γ≤ P P ≤b + γ i i y + y tion is optimal if the algorithm converges and ∗ ∗ i v=v ,r∈W i v=v ,r∈M (2) all constraints are satisfied. ∗ ∗ ∗ where b ≡ b (v , man) is the desired gender ra- In practice, it is hard to obtain a solution where tio of an activity v , γ is a user-specified margin. all corpus-level constrains are satisfied. However, M and W are a set of semantic role-values rep- we show that the performance of the proposed ap- resenting the agent as a man or a woman, respec- proach is empirically strong. We use imSitu for tively. vSRL as a running example to explain our algo- Note that the constraints in (2) involve all the rithm. test instances. Therefore, it requires a joint in- Structured Output Prediction As we men- ference over the entire test corpus. In general, tioned in Sec. 3, we assume the structured output these corpus-level constraints can be represented y ∈ Y consists of several sub-components. Given in a form of A y − b ≤ 0, where each row l×K a test instance i as an input, the inference problem in the matrix A ∈ R is the coefficients of one is to find constraint, and b ∈ R . The constrained inference arg max f (y, i), problem can then be formulated as: y∈Y where f (y, i) is a scoring function based on a max f (y , i), i i {y }∈{Y } model θ learned from the training data. The struc- X (3) tured output y and the scoring function f (y, i) can s.t. A y − b ≤ 0, be decomposed into small components based on an independence assumption. For example, in the vSRL task, the output y consists of two types of where {Y } represents a space spanned by possi- binary output variables{y } and{y }. The vari- ble combinations of labels for all instances. With- v v,r able y = 1 if and only if the activity v is chosen. out the corpus-level constraints, Eq. (3) can be Similarly, y = 1 if and only if both the activity v optimized by maximizing each instance i v,r and the semantic role r are assigned . The scoring max f (y , i), function f (y, i) is decomposed accordingly such y ∈Y that: X X separately. f (y, i) = y s (v, i) + y s (v, r, i), θ v θ v,r θ v v,r Lagrangian Relaxation Eq. (3) can be solved by several combinatorial optimization methods. A sufficiently large sample of test instances must be used so that bias statistics can be estimated. In this work we use For example, one can represent the problem as an the entire test set for each respective problem. 4 5 We use r to refer to a combination of role and noun. For For the sake of simplicity, we abuse the notations and use example, one possible value indicates an agent is a woman. i to represent both input and data index. 2982 Dataset Task Images O-Type kOk role in vSRL, and any occurrence in text associ- imSitu vSRL 60,000 verb 212 ated with the images in MLC. Problem statistics MS-COCO MLC 25,000 object 66 are summarized in Table 1. We also provide setup details for our calibration method. Table 1: Statistics for the two recognition prob- 5.1 Visual Semantic Role Labeling lems. In vSRL, we consider gender bias relating to verbs, while in MLC we consider the gender Dataset We evaluate on imSitu (Yatskar et al., bias related to objects. 2016) where activity classes are drawn from verbs and roles in FrameNet (Baker et al., 1998) and noun categories are drawn from WordNet (Miller integer linear program and solve it using an off- et al., 1990). The original dataset includes about the-shelf solver (e.g., Gurobi (Gurobi Optimiza- 125,000 images with 75,702 for training, 25,200 tion, 2016)). However, Eq. (3) involves all test in- for developing, and 25,200 for test. However, the stances. Solving a constrained optimization prob- dataset covers many non-human oriented activities lem on such a scale is difficult. Therefore, we con- (e.g., rearing, retrieving, and wagging), sider relaxing the constraints and solve Eq. (3) us- so we filter out these verbs, resulting in 212 verbs, ing a Lagrangian relaxation technique (Rush and leaving roughly 60,000 of the original 125,000 im- Collins, 2012). We introduce a Lagrangian multi- ages in the dataset. plier λ ≥ 0 for each corpus-level constraint. The Lagrangian is Model We build on the baseline CRF released with the data, which has been shown effective L(λ,{y }) = compared to a non-structured prediction base- X X X (4) line (Yatskar et al., 2016). The model decomposes i i f (y )− λ A y − b , θ j j j the probability of a realized situation, y, the com- i j=1 i bination of activity, v, and realized frame, a set of where all the λ ≥ 0,∀j ∈ {1, . . . , l}. The solu- semantic (role,noun) pairs (e, n ), given an image j e tion of Eq. (3) can be obtained by the following i as : iterative procedure: p(y|i; θ) ∝ ψ(v, i; θ) ψ(v, e, n , i; θ) 1) At iteration t, get the output solution of each (e,n )∈R instance i where each potential value in the CRF for subpart i,(t) (t−1) y = argmax L(λ , y) (5) x, is computed using features f from the VGG y∈Y convolutional neural network (Simonyan and Zis- serman, 2014) on an input image, as follows: 2) update the Lagrangian multipliers. w f +b i x ψ(x, i; θ) = e , (t) (t−1) i,(t) λ =max 0, λ + η(Ay − b) , where w and b are the parameters of an affine transformation layer. The model explicitly cap- (0) where λ = 0. η is the learning rate for updat- tures the correlation between activities and nouns (t−1) ing λ. Note that with a fixedλ , Eq. (5) can in semantic roles, allowing it to learn common pri- be solved using the original inference algorithms. ors. We use a model pretrained on the original task The algorithm loops until all constraints are satis- with 504 verbs. fied (i.e. optimal solution achieved) or reach max- imal number of iterations. 5.2 Multilabel Classification Dataset We use MS-COCO (Lin et al., 2014), 5 Experimental Setup a common object detection benchmark, for multi- In this section, we provide details about the two vi- label object classification. The dataset contains 80 sual recognition tasks we evaluated for bias: visual object types but does not make gender distinctions semantic role labeling (vSRL), and multi-label between man and woman. We use the five asso- classification (MLC). We focus on gender, defin- ciated image captions available for each image in ing G = {man, woman} and focus on the agent this dataset to annotate the gender of people in the 2983 images. If any of the captions mention the word 6.1 Visual Semantic Role Labeling man or woman we mark it, removing any images imSitu is gender biased In Figure 2(a), along that mention both genders. Finally, we filter any the x-axis, we show the male favoring bias of im- object category not strongly associated with hu- Situ verbs. Overall, the dataset is heavily biased mans by removing objects that do not occur with toward male agents, with 64.6% of verbs favoring man or woman at least 100 times in the training a male agent by an average bias of 0.707 (roughly set, leaving a total of 66 objects. 3:1 male). Nearly half of verbs are extremely bi- ased in the male or female direction: 46.95% of Model For this multi-label setting, we adapt a verbs favor a gender with a bias of at least 0.7. similar model as the structured CRF we use for Figure 2(a) contains several activity labels reveal- vSRL. We decompose the joint probability of the ing problematic biases. For example, shopping, output y, consisting of all object categories, c, and microwaving and washing are biased toward gender of the person, g, given an image i as: a female agent. Furthermore, several verbs such as driving, shooting, and coaching are p(y|i; θ) ∝ ψ(g, i; θ) ψ(g, c, i; θ) heavily biased toward a male agent. c∈y Training on imSitu amplifies bias In Fig- where each potential value for x, is computed us- ure 2(a), along the y-axis, we show the ratio of ing features, f , from a pretrained ResNet-50 con- male agents (% of total people) in predictions on volutional neural network evaluated on the image, an unseen development set. The mean bias ampli- w f +b fication in the development set is high, 0.050 on i x ψ(x, i; θ) = e . average, with 45.75% of verbs exhibiting ampli- fication. Biased verbs tend to have stronger am- We trained a model using SGD with learning rate plification: verbs with training bias over 0.7 in −5 −4 10 , momentum 0.9 and weight-decay 10 , fine either the male or female direction have a mean tuning the initial visual network, for 50 epochs. amplification of0.072. Several already problem- atic biases have gotten much worse. For example, 5.3 Calibration serving, only had a small bias toward females The inference problems for both models are: in the training set, 0.402, is now heavily biased toward females, 0.122. The verb tuning, origi- arg max f (y, i) = log p(y|i; θ). nally heavily biased toward males, 0.878, now has y∈Y exclusively male agents. We use the algorithm in Sec. (4) to calibrate the 6.2 Multilabel Classification predictions using model θ. Our calibration tries to enforce gender statistics derived from the training MS-COCO is gender biased In Figure 2(b) set of corpus applicable for each recognition prob- along the x-axis, similarly to imSitu, we ana- lem. For all experiments, we try to match gen- lyze bias of objects in MS-COCO with respect der ratios on the test set within a margin of .05 of to males. MS-COCO is even more heavily bi- their value on the training set. While we do adjust ased toward men than imSitu, with 86.6% of ob- the output on the test set, we never use the ground jects biased toward men, but with smaller average truth on the test set and instead working from the magnitude, 0.65. One third of the nouns are ex- assumption that it should be similarly distributed tremely biased toward males, 37.9% of nouns fa- as the training set. When running the debiasing al- vor men with a bias of at least 0.7. Some prob- −1 gorithm, we set η = 10 and optimize for 100 lematic examples include kitchen objects such as iterations. knife, fork, or spoon being more biased to- ward woman. Outdoor recreation related objects 6 Bias Analysis such tennis racket, snowboard and boat tend to be more biased toward men. In this section, we use the approaches outlined in Section 3 to quantify the bias and bias amplifi- 6 In this gender binary, bias toward woman is 1− the bias cation in the vSRL and the MLC tasks. toward man 2984 1.0 1.0 tuning snowboard coaching motorcycle aiming shoveling tie 0.9 boat shooting 0.8 traffic light skis pumping keyboard 0.8 driving hot dog 0.6 0.7 tennis racket 0.6 wine glass 0.4 0.5 spoon shopping 0.2 cooking serving 0.4 handbag knife combing fork microwaving twisting washing 0.0 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (a) Bias analysis on imSitu vSRL (b) Bias analysis on MS-COCO MLC Figure 2: Gender bias analysis of imSitu vSRL and MS-COCO MLC. (a) gender bias of verbs toward man in the training set versus bias on a predicted development set. (b) gender bias of nouns toward man in the training set versus bias on the predicted development set. Values near zero indicate bias toward woman while values near 0.5 indicate unbiased variables. Across both dataset, there is significant bias toward males, and significant bias amplification after training on biased training data. Training on MS-COCO amplifies bias In Fig- 7 Calibration Results ure 2(b), along the y-axis, we show the ratio of We test our methods for reducing bias amplifica- man (% of both gender) in predictions on an un- tion in two problem settings: visual semantic role seen development set. The mean bias amplifica- labeling in the imSitu dataset (vSRL) and multil- tion across all objects is 0.036, with 65.67% of abel image classification in MS-COCO (MLC). In nouns exhibiting amplification. Larger training all settings we derive corpus constraints using the bias again tended to indicate higher bias amplifi- training set and then run our calibration method in cation: biased objects with training bias over 0.7 batch on either the development or testing set. Our had mean amplification of0.081. Again, several results are summarized in Table 2 and Figure 3. problematic biases have now been amplified. For example, kitchen categories already biased toward 7.1 Visual Semantic Role Labeling females such as knife, fork and spoon have Our quantitative results are summarized in the first all been amplified. Technology oriented categories two sections of Table 2. On the development initially biased toward men such as keyboard set, the number of verbs whose bias exceed the and mouse have each increased their bias toward original bias by over 5% decreases 30.5% (Viol.). males by over 0.100. Overall, we are able to significantly reduce bias 6.3 Discussion amplification in vSRL by 52% on the develop- We confirmed our hypothesis that (a) both the im- ment set (Amp. bias). We evaluate the under- lying recognition performance using the standard Situ and MS-COCO datasets, gathered from the measure in vSRL: top-1 semantic role accuracy, web, are heavily gender biased and that (b) mod- which tests how often the correct verb was pre- els trained to perform prediction on these datasets dicted and the noun value was correctly assigned amplify the existing gender bias when evaluated on development data. Furthermore, across both to a semantic role. Our calibration method results in a negligible decrease in performance (Perf.). In datasets, we showed that the degree of bias am- plification was related to the size of the initial Figure 3(c) we can see that the overall distance to the training set distribution after applying RBA de- bias, with highly biased object and verb categories exhibiting more bias amplification. Our results creased significantly, over 39%. demonstrate that care needs be taken in deploying Figure 3(e) demonstrates that across all initial such uncalibrated systems otherwise they could training bias, RBA is able to reduce bias amplifi- not only reinforce existing social bias but actually cation. In general, RBA struggles to remove bias make them worse. amplification in areas of low initial training bias, predicted gender ratio predicted gender ratio 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.6 0.7 0.4 0.6 0.2 0.5 0.0 0.4 0.2 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (a) Bias analysis on imSitu vSRL without RBA (b) Bias analysis on MS-COCO MLC without RBA 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.6 0.7 0.4 0.6 0.2 0.5 0.0 0.4 0.2 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (c) Bias analysis on imSitu vSRL with RBA (d) Bias analysis on MS-COCO MLC with RBA 0.10 0.08 0.07 0.08 0.06 0.05 0.06 0.04 0.04 0.03 0.02 0.02 0.01 0.00 0.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (e) Bias in vSRL with (blue) / without (red) RBA (f) Bias in MLC with (blue) / without (red) RBA Figure 3: Results of reducing bias amplification using RBA on imSitu vSRL and MS-COCO MLC. Figures 3(a)-(d) show initial training set bias along the x-axis and development set bias along the y- axis. Dotted blue lines indicate the 0.05 margin used in RBA, with points violating the margin shown in red while points meeting the margin are shown in green. Across both settings adding RBA signifi- cantly reduces the number of violations, and reduces the bias amplification significantly. Figures 3(e)-(f) demonstrate bias amplification as a function of training bias, with and without RBA. Across all initial training biases, RBA is able to reduce the bias amplification. predicted gender ratio predicted gender ratio mean bias amplification mean bias amplification predicted gender ratio predicted gender ratio Method Viol. Amp. bias Perf. (%) progress with little or no loss in underlying recog- vSRL: Development Set nition performance. Across both problems, RBA CRF 154 0.050 24.07 was able to reduce bias amplification at all initial CRF + RBA 107 0.024 23.97 values of training bias. vSRL: Test Set 8 Conclusion CRF 149 0.042 24.14 CRF + RBA 102 0.025 24.01 Structured prediction models can leverage correla- MLC: Development Set tions that allow them to make correct predictions CRF 40 0.032 45.27 even with very little underlying evidence. Yet such CRF + RBA 24 0.022 45.19 models risk potentially leveraging social bias in MLC: Test Set their training data. In this paper, we presented a CRF 38 0.040 45.40 general framework for visualizing and quantify- CRF + RBA 16 0.021 45.38 ing biases in such models and proposed RBA to calibrate their predictions under two different set- Table 2: Number of violated constraints, mean tings. Taking gender bias as an example, our anal- amplified bias, and test performance before and af- ysis demonstrates that conditional random fields ter calibration using RBA. The test performances can amplify social bias from data while our ap- of vSRL and MLC are measured by top-1 seman- proach RBA can help to reduce the bias. tic role accuracy and top-1 mean average preci- Our work is the first to demonstrate structured sion, respectively. prediction models amplify bias and the first to propose methods for reducing this effect but sig- nificant avenues for future work remain. While likely because bias is encoded in image statistics RBA can be applied to any structured predic- and cannot be removed as effectively with an im- tor, it is unclear whether different predictors am- age agnostic adjustment. Results on the test set plify bias more or less. Furthermore, we pre- support our development set results: we decrease sented only one method for measuring bias. More bias amplification by 40.5% (Amp. bias). extensive analysis could explore the interaction among predictor, bias measurement, and bias de- 7.2 Multilabel Classification amplification method. Future work also includes Our quantitative results on MS-COCO RBA are applying bias reducing methods in other struc- summarized in the last two sections of Table 2. tured domains, such as pronoun reference resolu- Similarly to vSRL, we are able to reduce the num- tion (Mitkov, 2014). ber of objects whose bias exceeds the original training bias by 5%, by 40% (Viol.). Bias amplifi- Acknowledgement This work was supported in cation was reduced by 31.3% on the development part by National Science Foundation Grant IIS- set (Amp. bias). The underlying recognition sys- 1657193 and two NVIDIA Hardware Grants. tem was evaluated by the standard measure: top- 1 mean average precision, the precision averaged References across object categories. Our calibration method results in a negligible loss in performance. In Fig- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- ure 3(d), we demonstrate that we substantially re- garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question an- duce the distance between training bias and bias swering. In Proceedings of the IEEE International in the development set. Finally, in Figure 3(f) we Conference on Computer Vision, pages 2425–2433. demonstrate that we decrease bias amplification for all initial training bias settings. Results on the Collin F Baker, Charles J Fillmore, and John B Lowe. 1998. The Berkeley framenet project. In Proceed- test set support our development results: we de- ings of the Annual Meeting of the Association for crease bias amplification by 47.5% (Amp. bias). Computational Linguistics (ACL), pages 86–90. 7.3 Discussion Solon Barocas and Andrew D Selbst. 2014. Big data’s disparate impact. Available at SSRN 2477899. We have demonstrated that RBA can significantly reduce bias amplification. While were not able to Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, remove all amplification, we have made significant Venkatesh Saligrama, and Adam T Kalai. 2016. 2987 Man is to computer programmer as woman is to Tsung-Yi Lin, Michael Maire, Serge Belongie, James homemaker? debiasing word embeddings. In The Hays, Pietro Perona, Deva Ramanan, Piotr Dollar ´ , Conference on Advances in Neural Information Pro- and C Lawrence Zitnick. 2014. Microsoft coco: cessing Systems (NIPS), pages 4349–4357. Common objects in context. In European Confer- ence on Computer Vision, pages 740–755. Springer. Aylin Caliskan, Joanna J Bryson, and Arvind G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and Narayanan. 2017. Semantics derived automatically K.J. Miller. 1990. Wordnet: An on-line lexical from language corpora contain human-like biases. database. International Journal of Lexicography, Science, 356(6334):183–186. 3(4):235–312. Kai-Wei Chang, S. Sundararajan, and S. Sathiya Emiel van Miltenburg. 2016. Stereotyping and bias in Keerthi. 2013. Tractable semi-supervised learning the flickr30k dataset. MMC. of complex structured prediction models. In Pro- ceedings of the European Conference on Machine Ishan Misra, C Lawrence Zitnick, Margaret Mitchell, Learning (ECML), pages 176–191. and Ross Girshick. 2016. Seeing through the human reporting bias: Visual classifiers from noisy human- Yin-Wen Chang and Michael Collins. 2011. Exact de- centric labels. In Conference on Computer Vision coding of phrase-based translation models through and Pattern Recognition (CVPR), pages 2930–2939. Lagrangian relaxation. In EMNLP, pages 26–37. Ruslan Mitkov. 2014. Anaphora resolution. Rout- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ledge. ishna Vedantam, Saurabh Gupta, Piotr Dollar ´ , and C Lawrence Zitnick. 2015. Microsoft coco captions: Nanyun Peng, Ryan Cotterell, and Jason Eisner. 2015. Data collection and evaluation server. arXiv preprint Dual decomposition inference for graphical models arXiv:1504.00325. over strings. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages Bhavana Bharat Dalvi. 2015. Constrained Semi- 917–927. supervised Learning in the Presence of Unantici- pated Classes. Ph.D. thesis, Google Research. John Podesta, Penny Pritzker, Ernest J. Moniz, John Holdren, and Jefrey Zients. 2014. Big data: Seiz- Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer ing opportunities and preserving values. Executive Reingold, and Richard Zemel. 2012. Fairness Office of the President . through awareness. In Proceedings of the 3rd In- novations in Theoretical Computer Science Confer- Karen Ross and Cynthia Carter. 2011. Women and ence, pages 214–226. ACM. news: A long and winding road. Media, Culture & Society, 33(8):1148–1165. Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubrama- Alexander M Rush and Michael Collins. 2012. A Tuto- nian. 2015. Certifying and removing disparate im- rial on Dual Decomposition and Lagrangian Relax- pact. In Proceedings of International Conference ation for Inference in Natural Language Processing. on Knowledge Discovery and Data Mining (KDD), Journal of Artificial Intelligence Research , 45:305– pages 259–268. Jonathan Gordon and Benjamin Van Durme. 2013. Re- Karen Simonyan and Andrew Zisserman. 2014. Very porting bias and knowledge extraction. Automated deep convolutional networks for large-scale image Knowledge Base Construction (AKBC). recognition. arXiv preprint arXiv:1409.1556. Inc. Gurobi Optimization. 2016. Gurobi optimizer ref- David Sontag, Amir Globerson, and Tommi Jaakkola. erence manual. 2011. Introduction to dual decomposition for infer- ence. Optimization for Machine Learning, 1:219– Moritz Hardt, Eric Price, Nati Srebro, et al. 2016. Equality of opportunity in supervised learning. In Conference on Neural Information Processing Sys- Latanya Sweeney. 2013. Discrimination in online ad tems (NIPS), pages 3315–3323. delivery. Queue, 11(3):10. Matthew Kay, Cynthia Matuszek, and Sean A Munson. Benjamin D Van Durme. 2010. Extracting implicit 2015. Unequal representation and gender stereo- knowledge from text. Ph.D. thesis, University of types in image search results for occupations. In Rochester. Human Factors in Computing Systems, pages 3819– 3828. ACM. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural im- Bernhard Korte and Jens Vygen. 2008. Combinatorial age caption generator. In Proceedings of the IEEE Optimization: Theory and Application. Springer Conference on Computer Vision and Pattern Recog- Verlag. nition, pages 3156–3164. 2988 Mark Yatskar, Vicente Ordonez, Luke Zettlemoyer, and Ali Farhadi. 2017. Commonly uncommon: Seman- tic sparsity in situation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 5534–5542. Indre Zliobaite. 2015. A survey on measuring indirect discrimination in machine learning. arXiv preprint arXiv:1511.00148.

Journal

Proceedings of the 2017 Conference on Empirical Methods in Natural Language ProcessingUnpaywall

Published: Jan 1, 2017

There are no references for this article.