Men Also Like Shopping: Reducing Gender Bias Amplification using            Corpus-level Constraints

Jieyu Zhao; Tianlu Wang; Mark Yatskar; Vicente Ordonez; Kai-Wei Chang

doi:10.18653/v1/d17-1323

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Zhao, Jieyu;Wang, Tianlu;Yatskar, Mark;Ordonez, Vicente;Chang, Kai-Wei

2017-01-01 00:00:00

Men Also Like Shopping: Reducing Gender Bias Ampliﬁcation using Corpus-level Constraints § § ‡ Jieyu Zhao Tianlu Wang Mark Yatskar § § Vicente Ordonez Kai-Wei Chang University of Virginia {jz4fu, tw8cb, vicente, kc2wc}@virginia.edu University of Washington [email protected] Abstract tics from images and require large quantities of la- beled data, predominantly retrieved from the web. Language is increasingly being used to de- Methods often combine structured prediction and ﬁne rich visual recognition problems with deep learning to model correlations between la- supporting image collections sourced from bels and images to make judgments that otherwise the web. Structured prediction models are would have weak visual support. For example, in used in these tasks to take advantage of the ﬁrst image of Figure 1, it is possible to pre- correlations between co-occurring labels dict a spatula by considering that it is a com- and visual input but risk inadvertently en- mon tool used for the activity cooking. Yet such coding social biases found in web corpora. methods run the risk of discovering and exploiting In this work, we study data and models as- societal biases present in the underlying web cor- sociated with multilabel object classiﬁca- pora. Without properly quantifying and reducing tion and visual semantic role labeling. We the reliance on such correlations, broad adoption ﬁnd that (a) datasets for these tasks con- of these models can have the inadvertent effect of tain signiﬁcant gender bias and (b) mod- magnifying stereotypes. els trained on these datasets further am- In this paper, we develop a general framework plify existing bias. For example, the ac- for quantifying bias and study two concrete tasks, tivity cooking is over 33% more likely visual semantic role labeling (vSRL) and multil- to involve females than males in a train- abel object classiﬁcation (MLC). In vSRL, we use ing set, and a trained model further ampli- the imSitu formalism (Yatskar et al., 2016, 2017), ﬁes the disparity to 68% at test time. We where the goal is to predict activities, objects and propose to inject corpus-level constraints the roles those objects play within an activity. For for calibrating existing structured predic- MLC, we use MS-COCO (Lin et al., 2014; Chen tion models and design an algorithm based et al., 2015), a recognition task covering 80 object on Lagrangian relaxation for collective in- classes. We use gender bias as a running example ference. Our method results in almost no and show that both supporting datasets for these performance loss for the underlying recog- 1 tasks are biased with respect to a gender binary . nition task but decreases the magnitude of Our analysis reveals that over 45% and 37% bias ampliﬁcation by 47.5% and 40.5% for of verbs and objects, respectively, exhibit bias to- multilabel classiﬁcation and visual seman- ward a gender greater than 2:1. For example, as tic role labeling, respectively. seen in Figure 1, the cooking activity in imSitu is a heavily biased verb. Furthermore, we show 1 Introduction that after training state-of-the-art structured pre- dictors, models amplify the existing bias, by 5.0% Visual recognition tasks involving language, such for vSRL, and 3.6% in MLC. as captioning (Vinyals et al., 2015), visual ques- tion answering (Antol et al., 2015), and visual se- To simplify our analysis, we only consider a gender bi- mantic role labeling (Yatskar et al., 2016), have nary as perceived by annotators in the datasets. We recog- nize that a more ﬁne-grained analysis would be needed for emerged as avenues for expanding the diversity deployment in a production system. Also, note that the pro- of information that can be recovered from im- posed approach can be applied to other NLP tasks and other ages. These tasks aim at extracting rich seman- variables such as identiﬁcation with a racial or ethnic group. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics COOKING COOKING COOKING COOKING COOKING ROLE VALUE ROLE VALUE ROLE VALUE ROLE VALUE ROLE VALUE AGENT WOMAN AGENT WOMAN AGENT WOMAN AGENT WOMAN AGENT MAN ∅ ∅ FOOD PASTA FOOD FRUIT FOOD FOOD FOOD MEAT HEAT HEAT STOVE HEAT STOVE HEAT STOVE HEAT STOVE TOOL SPATULA TOOL KNIFE TOOL SPATULA TOOL SPATULA TOOL SPATULA PLACE KITCHEN PLACE KITCHEN PLACE OUTSIDE PLACE KITCHEN PLACE KITCHEN Figure 1: Five example images from the imSitu visual semantic role labeling (vSRL) dataset. Each im- age is paired with a table describing a situation: the verb, cooking, its semantic roles, i.e agent, and noun values ﬁlling that role, i.e.woman. In the imSitu training set, 33% of cooking images have man in the agent role while the rest have woman. After training a Conditional Random Field (CRF), bias is ampliﬁed:man ﬁlls 16% ofagent roles in cooking images. To reduce this bias ampliﬁcation our cal- ibration method adjusts weights of CRF potentials associated with biased predictions. After applying our methods, man appears in the agent role of 20% of cooking images, reducing the bias ampliﬁcation by 25%, while keeping the CRF vSRL performance unchanged. To mitigate the role of bias ampliﬁcation when 2 Related Work training models on biased corpora, we propose As intelligence systems start playing important a novel constrained inference framework, called roles in our daily life, ethics in artiﬁcial in- RBA, for Reducing Bias Ampliﬁcation in predic- telligence research has attracted signiﬁcant in- tions. Our method introduces corpus-level con- terest. It is known that big-data technologies straints so that gender indicators co-occur no more sometimes inadvertently worsen discrimination often together with elements of the prediction task due to implicit biases in data (Podesta et al., than in the original training distribution. For ex- 2014). Such issues have been demonstrated in var- ample, as seen in Figure 1, we would like noun ious learning systems, including online advertise- man to occur in the agent role of the cooking ment systems (Sweeney, 2013), word embedding as often as it occurs in the imSitu training set when models (Bolukbasi et al., 2016; Caliskan et al., evaluating on a development set. We combine 2017), online news (Ross and Carter, 2011), web our calibration constraint with the original struc- search (Kay et al., 2015), and credit score (Hardt tured predictor and use Lagrangian relaxation (Ko- et al., 2016). Data collection biases have been rte and Vygen, 2008; Rush and Collins, 2012) to discussed in the context of creating image cor- reweigh bias creating factors in the original model. pus (Misra et al., 2016; van Miltenburg, 2016) We evaluate our calibration method on imSitu and text corpus (Gordon and Van Durme, 2013; vSRL and COCO MLC and ﬁnd that in both in- Van Durme, 2010). In contrast, we show that given stances, our models substantially reduce bias am- a gender biased corpus, structured models such as pliﬁcation. For vSRL, we reduce the average mag- conditional random ﬁelds, amplify the bias. nitude of bias ampliﬁcation by 40.5%. For MLC, The effect of the data imbalance can be easily we are able to reduce the average magnitude of detected and ﬁxed when the prediction task is sim- bias ampliﬁcation by 47.5%. Overall, our calibra- ple. For example, when classifying binary data tion methods do not affect the performance of the with unbalanced labels (i.e., samples in the major- underlying visual system, while substantially re- ity class dominate the dataset), a classiﬁer trained ducing the reliance of the system on socially bi- exclusively to optimize accuracy learns to always ased correlations . predict the majority label, as the cost of mak- ing mistakes on samples in the minority class can be neglected. Various approaches have been pro- 2 posed to make a “fair” binary classiﬁcation (Baro- Code and data are available at https://github. com/uclanlp/reducingbias cas and Selbst, 2014; Dwork et al., 2012; Feldman 2980 et al., 2015; Zliobaite, 2015). For structured pre- variable, g, as: diction tasks the effect is harder to quantify and c(o, g) we are the ﬁrst to propose methods to reduce bias b(o, g) = , c(o, g ) ampliﬁcation in this context. g ∈G Lagrangian relaxation and dual decomposi- where c(o, g) is the number of occurrences of o tion techniques have been widely used in NLP and g in a corpus. For example, to analyze how tasks (e.g., (Sontag et al., 2011; Rush and Collins, genders of agents and activities are co-related in 2012; Chang and Collins, 2011; Peng et al., 2015)) vSRL, we deﬁne the gender bias towardman for for dealing with instance-level constraints. Simi- each verb b(verb, man) as: lar techniques (Chang et al., 2013; Dalvi, 2015) have been applied in handling corpus-level con- c(verb, man) . (1) straints for semi-supervised multilabel classiﬁca- c(verb, man) + c(verb, woman) tion. In contrast to previous works aiming for If b(o, g) > 1/kGk, then o is positively correlated improving accuracy performance, we incorporate corpus-level constraints for reducing gender bias. with g and may exhibit bias. Evaluating bias ampliﬁcation To evaluate the degree of bias ampliﬁcation, we propose to com- 3 Visualizing and Quantifying Biases pare bias scores on the training set, b (o, g), with bias scores on an unlabeled evaluation set of im- Modern statistical learning approaches capture ages b(o, g) that has been annotated by a predic- correlations among output variables in order to tor. We assume that the evaluation set is iden- make coherent predictions. However, for real- tically distributed to the training set. There- world applications, some implicit correlations are fore, if o is positively correlated with g (i.e, not appropriate, especially if they are ampliﬁed. b (o, g) > 1/kGk) and b(o, g) is larger than In this section, we present a general framework to b (o, g), we say bias has been ampliﬁed. For analyze inherent biases learned and ampliﬁed by a example, if b (cooking, woman) = .66, and prediction model. b(cooking, woman) = .84, then the bias of woman toward cooking has been ampliﬁed. Fi- Identifying bias We consider that prediction nally, we deﬁne the mean bias ampliﬁcation as: problems involve several inter-dependent output variables y , y , ...y , which can be represented 1 2 K X X as a structure y = {y , y , ...y } ∈ Y . This ∗ 1 2 K b(o, g)− b (o, g). |O| is a common setting in NLP applications, includ- o∈{o∈O|b (o,g)>1/kGk} ing tagging, and parsing. For example, in the vSRL task, the output can be represented as a This score estimates the average magnitude of bias structured table as shown in Fig 1. Modern tech- ampliﬁcation for pairs ofo and g which exhibited niques often model the correlation between the bias. sub-components in y and make a joint prediction 4 Calibration Algorithm over them using a structured prediction model. More details will be provided in Section 4. In this section, we introduce Reducing Bias We assume there is a subset of output vari- Ampliﬁcation, RBA, a debiasing technique for ables g ⊆ y, g ∈ G that reﬂects demographic at- calibrating the predictions from a structured pre- tributes such as gender or race (e.g. g ∈ G = diction model. The intuition behind the algorithm {man, woman} is the agent), and there is another is to inject constraints to ensure the model pre- subset of the output o ⊆ y, o ∈ O that are co- dictions follow the distribution observed from the related with g (e.g., o is the activity present in an training data. For example, the constraints added image, such as cooking). The goal is to identify to the vSRL system ensure the gender ratio of each the correlations that are potentially ampliﬁed by a verb in Eq. (1) are within a given margin based on learned model. the statistics of the training data. These constraints To achieve this, we deﬁne the bias score of a are applied at the corpus level, because comput- given output, o, with respect to a demographic ing gender ratio requires the predictions of all test 2981 instances. As a result, a joint inference over test represents the overall score of an assignment, and instances is required . Solving such a giant in- s (v, i) and s (v, r, i) are the potentials of the sub- θ θ ference problem with constraints is hard. There- assignments. The output space Y contains all fea- fore, we present an approximate inference algo- sible assignments of y and y , which can be rep- v v,r rithm based on Lagrangian relaxation. The advan- resented as instance-wise constraints. For exam- tages of this approach are: ple, the constraint, y = 1 ensures only one activity is assigned to one image. • Our algorithm is iterative, and at each it- eration, the joint inference problem is de- Corpus-level Constraints Our goal is to inject composed to a per-instance basis. This can constraints to ensure the output labels follow a be solved by the original inference algo- desired distribution. For example, we can set a rithm. That is, our approach works as a meta- constraint to ensure the gender ratio for each ac- algorithm and developers do not need to im- tivity in Eq. (1) is within a given margin. Let i i i plement a new inference algorithm. y = {y } ∪ {y } be the output assignment for v v,r 5 ∗ test instance i . For each activity v , the con- • The approach is general and can be applied in straints can be written as any structured model. i v=v ,r∈M • Lagrangian relaxation guarantees the solu- ∗ ∗ b −γ≤ P P ≤b + γ i i y + y tion is optimal if the algorithm converges and ∗ ∗ i v=v ,r∈W i v=v ,r∈M (2) all constraints are satisﬁed. ∗ ∗ ∗ where b ≡ b (v , man) is the desired gender ra- In practice, it is hard to obtain a solution where tio of an activity v , γ is a user-speciﬁed margin. all corpus-level constrains are satisﬁed. However, M and W are a set of semantic role-values rep- we show that the performance of the proposed ap- resenting the agent as a man or a woman, respec- proach is empirically strong. We use imSitu for tively. vSRL as a running example to explain our algo- Note that the constraints in (2) involve all the rithm. test instances. Therefore, it requires a joint in- Structured Output Prediction As we men- ference over the entire test corpus. In general, tioned in Sec. 3, we assume the structured output these corpus-level constraints can be represented y ∈ Y consists of several sub-components. Given in a form of A y − b ≤ 0, where each row l×K a test instance i as an input, the inference problem in the matrix A ∈ R is the coefﬁcients of one is to ﬁnd constraint, and b ∈ R . The constrained inference arg max f (y, i), problem can then be formulated as: y∈Y where f (y, i) is a scoring function based on a max f (y , i), i i {y }∈{Y } model θ learned from the training data. The struc- X (3) tured output y and the scoring function f (y, i) can s.t. A y − b ≤ 0, be decomposed into small components based on an independence assumption. For example, in the vSRL task, the output y consists of two types of where {Y } represents a space spanned by possi- binary output variables{y } and{y }. The vari- ble combinations of labels for all instances. With- v v,r able y = 1 if and only if the activity v is chosen. out the corpus-level constraints, Eq. (3) can be Similarly, y = 1 if and only if both the activity v optimized by maximizing each instance i v,r and the semantic role r are assigned . The scoring max f (y , i), function f (y, i) is decomposed accordingly such y ∈Y that: X X separately. f (y, i) = y s (v, i) + y s (v, r, i), θ v θ v,r θ v v,r Lagrangian Relaxation Eq. (3) can be solved by several combinatorial optimization methods. A sufﬁciently large sample of test instances must be used so that bias statistics can be estimated. In this work we use For example, one can represent the problem as an the entire test set for each respective problem. 4 5 We use r to refer to a combination of role and noun. For For the sake of simplicity, we abuse the notations and use example, one possible value indicates an agent is a woman. i to represent both input and data index. 2982 Dataset Task Images O-Type kOk role in vSRL, and any occurrence in text associ- imSitu vSRL 60,000 verb 212 ated with the images in MLC. Problem statistics MS-COCO MLC 25,000 object 66 are summarized in Table 1. We also provide setup details for our calibration method. Table 1: Statistics for the two recognition prob- 5.1 Visual Semantic Role Labeling lems. In vSRL, we consider gender bias relating to verbs, while in MLC we consider the gender Dataset We evaluate on imSitu (Yatskar et al., bias related to objects. 2016) where activity classes are drawn from verbs and roles in FrameNet (Baker et al., 1998) and noun categories are drawn from WordNet (Miller integer linear program and solve it using an off- et al., 1990). The original dataset includes about the-shelf solver (e.g., Gurobi (Gurobi Optimiza- 125,000 images with 75,702 for training, 25,200 tion, 2016)). However, Eq. (3) involves all test in- for developing, and 25,200 for test. However, the stances. Solving a constrained optimization prob- dataset covers many non-human oriented activities lem on such a scale is difﬁcult. Therefore, we con- (e.g., rearing, retrieving, and wagging), sider relaxing the constraints and solve Eq. (3) us- so we ﬁlter out these verbs, resulting in 212 verbs, ing a Lagrangian relaxation technique (Rush and leaving roughly 60,000 of the original 125,000 im- Collins, 2012). We introduce a Lagrangian multi- ages in the dataset. plier λ ≥ 0 for each corpus-level constraint. The Lagrangian is Model We build on the baseline CRF released with the data, which has been shown effective L(λ,{y }) = compared to a non-structured prediction base- X X X (4) line (Yatskar et al., 2016). The model decomposes i i f (y )− λ A y − b , θ j j j the probability of a realized situation, y, the com- i j=1 i bination of activity, v, and realized frame, a set of where all the λ ≥ 0,∀j ∈ {1, . . . , l}. The solu- semantic (role,noun) pairs (e, n ), given an image j e tion of Eq. (3) can be obtained by the following i as : iterative procedure: p(y|i; θ) ∝ ψ(v, i; θ) ψ(v, e, n , i; θ) 1) At iteration t, get the output solution of each (e,n )∈R instance i where each potential value in the CRF for subpart i,(t) (t−1) y = argmax L(λ , y) (5) x, is computed using features f from the VGG y∈Y convolutional neural network (Simonyan and Zis- serman, 2014) on an input image, as follows: 2) update the Lagrangian multipliers. w f +b i x ψ(x, i; θ) = e , (t) (t−1) i,(t) λ =max 0, λ + η(Ay − b) , where w and b are the parameters of an afﬁne transformation layer. The model explicitly cap- (0) where λ = 0. η is the learning rate for updat- tures the correlation between activities and nouns (t−1) ing λ. Note that with a ﬁxedλ , Eq. (5) can in semantic roles, allowing it to learn common pri- be solved using the original inference algorithms. ors. We use a model pretrained on the original task The algorithm loops until all constraints are satis- with 504 verbs. ﬁed (i.e. optimal solution achieved) or reach max- imal number of iterations. 5.2 Multilabel Classiﬁcation Dataset We use MS-COCO (Lin et al., 2014), 5 Experimental Setup a common object detection benchmark, for multi- In this section, we provide details about the two vi- label object classiﬁcation. The dataset contains 80 sual recognition tasks we evaluated for bias: visual object types but does not make gender distinctions semantic role labeling (vSRL), and multi-label between man and woman. We use the ﬁve asso- classiﬁcation (MLC). We focus on gender, deﬁn- ciated image captions available for each image in ing G = {man, woman} and focus on the agent this dataset to annotate the gender of people in the 2983 images. If any of the captions mention the word 6.1 Visual Semantic Role Labeling man or woman we mark it, removing any images imSitu is gender biased In Figure 2(a), along that mention both genders. Finally, we ﬁlter any the x-axis, we show the male favoring bias of im- object category not strongly associated with hu- Situ verbs. Overall, the dataset is heavily biased mans by removing objects that do not occur with toward male agents, with 64.6% of verbs favoring man or woman at least 100 times in the training a male agent by an average bias of 0.707 (roughly set, leaving a total of 66 objects. 3:1 male). Nearly half of verbs are extremely bi- ased in the male or female direction: 46.95% of Model For this multi-label setting, we adapt a verbs favor a gender with a bias of at least 0.7. similar model as the structured CRF we use for Figure 2(a) contains several activity labels reveal- vSRL. We decompose the joint probability of the ing problematic biases. For example, shopping, output y, consisting of all object categories, c, and microwaving and washing are biased toward gender of the person, g, given an image i as: a female agent. Furthermore, several verbs such as driving, shooting, and coaching are p(y|i; θ) ∝ ψ(g, i; θ) ψ(g, c, i; θ) heavily biased toward a male agent. c∈y Training on imSitu ampliﬁes bias In Fig- where each potential value for x, is computed us- ure 2(a), along the y-axis, we show the ratio of ing features, f , from a pretrained ResNet-50 con- male agents (% of total people) in predictions on volutional neural network evaluated on the image, an unseen development set. The mean bias ampli- w f +b ﬁcation in the development set is high, 0.050 on i x ψ(x, i; θ) = e . average, with 45.75% of verbs exhibiting ampli- ﬁcation. Biased verbs tend to have stronger am- We trained a model using SGD with learning rate pliﬁcation: verbs with training bias over 0.7 in −5 −4 10 , momentum 0.9 and weight-decay 10 , ﬁne either the male or female direction have a mean tuning the initial visual network, for 50 epochs. ampliﬁcation of0.072. Several already problem- atic biases have gotten much worse. For example, 5.3 Calibration serving, only had a small bias toward females The inference problems for both models are: in the training set, 0.402, is now heavily biased toward females, 0.122. The verb tuning, origi- arg max f (y, i) = log p(y|i; θ). nally heavily biased toward males, 0.878, now has y∈Y exclusively male agents. We use the algorithm in Sec. (4) to calibrate the 6.2 Multilabel Classiﬁcation predictions using model θ. Our calibration tries to enforce gender statistics derived from the training MS-COCO is gender biased In Figure 2(b) set of corpus applicable for each recognition prob- along the x-axis, similarly to imSitu, we ana- lem. For all experiments, we try to match gen- lyze bias of objects in MS-COCO with respect der ratios on the test set within a margin of .05 of to males. MS-COCO is even more heavily bi- their value on the training set. While we do adjust ased toward men than imSitu, with 86.6% of ob- the output on the test set, we never use the ground jects biased toward men, but with smaller average truth on the test set and instead working from the magnitude, 0.65. One third of the nouns are ex- assumption that it should be similarly distributed tremely biased toward males, 37.9% of nouns fa- as the training set. When running the debiasing al- vor men with a bias of at least 0.7. Some prob- −1 gorithm, we set η = 10 and optimize for 100 lematic examples include kitchen objects such as iterations. knife, fork, or spoon being more biased to- ward woman. Outdoor recreation related objects 6 Bias Analysis such tennis racket, snowboard and boat tend to be more biased toward men. In this section, we use the approaches outlined in Section 3 to quantify the bias and bias ampliﬁ- 6 In this gender binary, bias toward woman is 1− the bias cation in the vSRL and the MLC tasks. toward man 2984 1.0 1.0 tuning snowboard coaching motorcycle aiming shoveling tie 0.9 boat shooting 0.8 traffic light skis pumping keyboard 0.8 driving hot dog 0.6 0.7 tennis racket 0.6 wine glass 0.4 0.5 spoon shopping 0.2 cooking serving 0.4 handbag knife combing fork microwaving twisting washing 0.0 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (a) Bias analysis on imSitu vSRL (b) Bias analysis on MS-COCO MLC Figure 2: Gender bias analysis of imSitu vSRL and MS-COCO MLC. (a) gender bias of verbs toward man in the training set versus bias on a predicted development set. (b) gender bias of nouns toward man in the training set versus bias on the predicted development set. Values near zero indicate bias toward woman while values near 0.5 indicate unbiased variables. Across both dataset, there is signiﬁcant bias toward males, and signiﬁcant bias ampliﬁcation after training on biased training data. Training on MS-COCO ampliﬁes bias In Fig- 7 Calibration Results ure 2(b), along the y-axis, we show the ratio of We test our methods for reducing bias ampliﬁca- man (% of both gender) in predictions on an un- tion in two problem settings: visual semantic role seen development set. The mean bias ampliﬁca- labeling in the imSitu dataset (vSRL) and multil- tion across all objects is 0.036, with 65.67% of abel image classiﬁcation in MS-COCO (MLC). In nouns exhibiting ampliﬁcation. Larger training all settings we derive corpus constraints using the bias again tended to indicate higher bias ampliﬁ- training set and then run our calibration method in cation: biased objects with training bias over 0.7 batch on either the development or testing set. Our had mean ampliﬁcation of0.081. Again, several results are summarized in Table 2 and Figure 3. problematic biases have now been ampliﬁed. For example, kitchen categories already biased toward 7.1 Visual Semantic Role Labeling females such as knife, fork and spoon have Our quantitative results are summarized in the ﬁrst all been ampliﬁed. Technology oriented categories two sections of Table 2. On the development initially biased toward men such as keyboard set, the number of verbs whose bias exceed the and mouse have each increased their bias toward original bias by over 5% decreases 30.5% (Viol.). males by over 0.100. Overall, we are able to signiﬁcantly reduce bias 6.3 Discussion ampliﬁcation in vSRL by 52% on the develop- We conﬁrmed our hypothesis that (a) both the im- ment set (Amp. bias). We evaluate the under- lying recognition performance using the standard Situ and MS-COCO datasets, gathered from the measure in vSRL: top-1 semantic role accuracy, web, are heavily gender biased and that (b) mod- which tests how often the correct verb was pre- els trained to perform prediction on these datasets dicted and the noun value was correctly assigned amplify the existing gender bias when evaluated on development data. Furthermore, across both to a semantic role. Our calibration method results in a negligible decrease in performance (Perf.). In datasets, we showed that the degree of bias am- pliﬁcation was related to the size of the initial Figure 3(c) we can see that the overall distance to the training set distribution after applying RBA de- bias, with highly biased object and verb categories exhibiting more bias ampliﬁcation. Our results creased signiﬁcantly, over 39%. demonstrate that care needs be taken in deploying Figure 3(e) demonstrates that across all initial such uncalibrated systems otherwise they could training bias, RBA is able to reduce bias ampliﬁ- not only reinforce existing social bias but actually cation. In general, RBA struggles to remove bias make them worse. ampliﬁcation in areas of low initial training bias, predicted gender ratio predicted gender ratio 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.6 0.7 0.4 0.6 0.2 0.5 0.0 0.4 0.2 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (a) Bias analysis on imSitu vSRL without RBA (b) Bias analysis on MS-COCO MLC without RBA 1.2 1.1 1.0 1.0 0.9 0.8 0.8 0.6 0.7 0.4 0.6 0.2 0.5 0.0 0.4 0.2 0.3 0.0 0.2 0.4 0.6 0.8 1.0 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (c) Bias analysis on imSitu vSRL with RBA (d) Bias analysis on MS-COCO MLC with RBA 0.10 0.08 0.07 0.08 0.06 0.05 0.06 0.04 0.04 0.03 0.02 0.02 0.01 0.00 0.00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 0.4 0.5 0.6 0.7 0.8 0.9 1.0 training gender ratio training gender ratio (e) Bias in vSRL with (blue) / without (red) RBA (f) Bias in MLC with (blue) / without (red) RBA Figure 3: Results of reducing bias ampliﬁcation using RBA on imSitu vSRL and MS-COCO MLC. Figures 3(a)-(d) show initial training set bias along the x-axis and development set bias along the y- axis. Dotted blue lines indicate the 0.05 margin used in RBA, with points violating the margin shown in red while points meeting the margin are shown in green. Across both settings adding RBA signiﬁ- cantly reduces the number of violations, and reduces the bias ampliﬁcation signiﬁcantly. Figures 3(e)-(f) demonstrate bias ampliﬁcation as a function of training bias, with and without RBA. Across all initial training biases, RBA is able to reduce the bias ampliﬁcation. predicted gender ratio predicted gender ratio mean bias amplification mean bias amplification predicted gender ratio predicted gender ratio Method Viol. Amp. bias Perf. (%) progress with little or no loss in underlying recog- vSRL: Development Set nition performance. Across both problems, RBA CRF 154 0.050 24.07 was able to reduce bias ampliﬁcation at all initial CRF + RBA 107 0.024 23.97 values of training bias. vSRL: Test Set 8 Conclusion CRF 149 0.042 24.14 CRF + RBA 102 0.025 24.01 Structured prediction models can leverage correla- MLC: Development Set tions that allow them to make correct predictions CRF 40 0.032 45.27 even with very little underlying evidence. Yet such CRF + RBA 24 0.022 45.19 models risk potentially leveraging social bias in MLC: Test Set their training data. In this paper, we presented a CRF 38 0.040 45.40 general framework for visualizing and quantify- CRF + RBA 16 0.021 45.38 ing biases in such models and proposed RBA to calibrate their predictions under two different set- Table 2: Number of violated constraints, mean tings. Taking gender bias as an example, our anal- ampliﬁed bias, and test performance before and af- ysis demonstrates that conditional random ﬁelds ter calibration using RBA. The test performances can amplify social bias from data while our ap- of vSRL and MLC are measured by top-1 seman- proach RBA can help to reduce the bias. tic role accuracy and top-1 mean average preci- Our work is the ﬁrst to demonstrate structured sion, respectively. prediction models amplify bias and the ﬁrst to propose methods for reducing this effect but sig- niﬁcant avenues for future work remain. While likely because bias is encoded in image statistics RBA can be applied to any structured predic- and cannot be removed as effectively with an im- tor, it is unclear whether different predictors am- age agnostic adjustment. Results on the test set plify bias more or less. Furthermore, we pre- support our development set results: we decrease sented only one method for measuring bias. More bias ampliﬁcation by 40.5% (Amp. bias). extensive analysis could explore the interaction among predictor, bias measurement, and bias de- 7.2 Multilabel Classiﬁcation ampliﬁcation method. Future work also includes Our quantitative results on MS-COCO RBA are applying bias reducing methods in other struc- summarized in the last two sections of Table 2. tured domains, such as pronoun reference resolu- Similarly to vSRL, we are able to reduce the num- tion (Mitkov, 2014). ber of objects whose bias exceeds the original training bias by 5%, by 40% (Viol.). Bias ampliﬁ- Acknowledgement This work was supported in cation was reduced by 31.3% on the development part by National Science Foundation Grant IIS- set (Amp. bias). The underlying recognition sys- 1657193 and two NVIDIA Hardware Grants. tem was evaluated by the standard measure: top- 1 mean average precision, the precision averaged References across object categories. Our calibration method results in a negligible loss in performance. In Fig- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar- ure 3(d), we demonstrate that we substantially re- garet Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question an- duce the distance between training bias and bias swering. In Proceedings of the IEEE International in the development set. Finally, in Figure 3(f) we Conference on Computer Vision, pages 2425–2433. demonstrate that we decrease bias ampliﬁcation for all initial training bias settings. Results on the Collin F Baker, Charles J Fillmore, and John B Lowe. 1998. The Berkeley framenet project. In Proceed- test set support our development results: we de- ings of the Annual Meeting of the Association for crease bias ampliﬁcation by 47.5% (Amp. bias). Computational Linguistics (ACL), pages 86–90. 7.3 Discussion Solon Barocas and Andrew D Selbst. 2014. Big data’s disparate impact. Available at SSRN 2477899. We have demonstrated that RBA can signiﬁcantly reduce bias ampliﬁcation. While were not able to Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, remove all ampliﬁcation, we have made signiﬁcant Venkatesh Saligrama, and Adam T Kalai. 2016. 2987 Man is to computer programmer as woman is to Tsung-Yi Lin, Michael Maire, Serge Belongie, James homemaker? debiasing word embeddings. In The Hays, Pietro Perona, Deva Ramanan, Piotr Dollar ´ , Conference on Advances in Neural Information Pro- and C Lawrence Zitnick. 2014. Microsoft coco: cessing Systems (NIPS), pages 4349–4357. Common objects in context. In European Confer- ence on Computer Vision, pages 740–755. Springer. Aylin Caliskan, Joanna J Bryson, and Arvind G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and Narayanan. 2017. Semantics derived automatically K.J. Miller. 1990. Wordnet: An on-line lexical from language corpora contain human-like biases. database. International Journal of Lexicography, Science, 356(6334):183–186. 3(4):235–312. Kai-Wei Chang, S. Sundararajan, and S. Sathiya Emiel van Miltenburg. 2016. Stereotyping and bias in Keerthi. 2013. Tractable semi-supervised learning the ﬂickr30k dataset. MMC. of complex structured prediction models. In Pro- ceedings of the European Conference on Machine Ishan Misra, C Lawrence Zitnick, Margaret Mitchell, Learning (ECML), pages 176–191. and Ross Girshick. 2016. Seeing through the human reporting bias: Visual classiﬁers from noisy human- Yin-Wen Chang and Michael Collins. 2011. Exact de- centric labels. In Conference on Computer Vision coding of phrase-based translation models through and Pattern Recognition (CVPR), pages 2930–2939. Lagrangian relaxation. In EMNLP, pages 26–37. Ruslan Mitkov. 2014. Anaphora resolution. Rout- Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- ledge. ishna Vedantam, Saurabh Gupta, Piotr Dollar ´ , and C Lawrence Zitnick. 2015. Microsoft coco captions: Nanyun Peng, Ryan Cotterell, and Jason Eisner. 2015. Data collection and evaluation server. arXiv preprint Dual decomposition inference for graphical models arXiv:1504.00325. over strings. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages Bhavana Bharat Dalvi. 2015. Constrained Semi- 917–927. supervised Learning in the Presence of Unantici- pated Classes. Ph.D. thesis, Google Research. John Podesta, Penny Pritzker, Ernest J. Moniz, John Holdren, and Jefrey Zients. 2014. Big data: Seiz- Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer ing opportunities and preserving values. Executive Reingold, and Richard Zemel. 2012. Fairness Ofﬁce of the President . through awareness. In Proceedings of the 3rd In- novations in Theoretical Computer Science Confer- Karen Ross and Cynthia Carter. 2011. Women and ence, pages 214–226. ACM. news: A long and winding road. Media, Culture & Society, 33(8):1148–1165. Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubrama- Alexander M Rush and Michael Collins. 2012. A Tuto- nian. 2015. Certifying and removing disparate im- rial on Dual Decomposition and Lagrangian Relax- pact. In Proceedings of International Conference ation for Inference in Natural Language Processing. on Knowledge Discovery and Data Mining (KDD), Journal of Artiﬁcial Intelligence Research , 45:305– pages 259–268. Jonathan Gordon and Benjamin Van Durme. 2013. Re- Karen Simonyan and Andrew Zisserman. 2014. Very porting bias and knowledge extraction. Automated deep convolutional networks for large-scale image Knowledge Base Construction (AKBC). recognition. arXiv preprint arXiv:1409.1556. Inc. Gurobi Optimization. 2016. Gurobi optimizer ref- David Sontag, Amir Globerson, and Tommi Jaakkola. erence manual. 2011. Introduction to dual decomposition for infer- ence. Optimization for Machine Learning, 1:219– Moritz Hardt, Eric Price, Nati Srebro, et al. 2016. Equality of opportunity in supervised learning. In Conference on Neural Information Processing Sys- Latanya Sweeney. 2013. Discrimination in online ad tems (NIPS), pages 3315–3323. delivery. Queue, 11(3):10. Matthew Kay, Cynthia Matuszek, and Sean A Munson. Benjamin D Van Durme. 2010. Extracting implicit 2015. Unequal representation and gender stereo- knowledge from text. Ph.D. thesis, University of types in image search results for occupations. In Rochester. Human Factors in Computing Systems, pages 3819– 3828. ACM. Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural im- Bernhard Korte and Jens Vygen. 2008. Combinatorial age caption generator. In Proceedings of the IEEE Optimization: Theory and Application. Springer Conference on Computer Vision and Pattern Recog- Verlag. nition, pages 3156–3164. 2988 Mark Yatskar, Vicente Ordonez, Luke Zettlemoyer, and Ali Farhadi. 2017. Commonly uncommon: Seman- tic sparsity in situation recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi. 2016. Situation recognition: Visual semantic role labeling for image understanding. In Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 5534–5542. Indre Zliobaite. 2015. A survey on measuring indirect discrimination in machine learning. arXiv preprint arXiv:1511.00148.

http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing Unpaywall

http://www.deepdyve.com/lp/unpaywall/men-also-like-shopping-reducing-gender-bias-amplification-using-corpus-NqOLbLCdTa

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Zhao, Jieyu; Wang, Tianlu; Yatskar, Mark; Ordonez, Vicente; Chang, Kai-Wei

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing – Jan 1, 2017

Loading next page...

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher: Unpaywall
DOI: 10.18653/v1/d17-1323
Publisher site: See Article on Publisher Site

Abstract

Journal

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing – Unpaywall

Published: Jan 1, 2017

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

References

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies