| GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding A Wang, A Singh, J Michael, F Hill, O Levy, SR Bowman Proceedings of ICLR, 2019 | 9417 | 2019 |
| A large annotated corpus for learning natural language inference SR Bowman, G Angeli, C Potts, CD Manning Proceedings of EMNLP, 2015 | 5614 | 2015 |
| A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference A Williams, N Nangia, SR Bowman Proceedings of NAACL-HLT, 2018 | 5502 | 2018 |
| Generating sentences from a continuous space SR Bowman, L Vilnis, O Vinyals, AM Dai, R Jozefowicz, S Bengio Proceedings of CoNLL, 2016 | 3131 | 2016 |
| SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems A Wang, Y Pruksachatkun, N Nangia, A Singh, J Michael, F Hill, O Levy, ... Proceedings of NeurIPS, 2019 | 2945 | 2019 |
| Constitutional AI: Harmlessness from AI feedback Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... arXiv preprint arXiv:2212.08073, 2022 | 2223 | 2022 |
| Beyond the imitation game: Quantifying and extrapolating the capabilities of language models A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ... TMLR, 2023 | 2011 | 2023 |
| Neural network acceptability judgments A Warstadt, A Singh, SR Bowman TACL 7, 625-641, 2019 | 1703 | 2019 |
| XNLI: Evaluating Cross-lingual Sentence Representations A Conneau, G Lample, R Rinott, A Williams, SR Bowman, H Schwenk, ... Proceedings of EMNLP, 2018 | 1680 | 2018 |
| Annotation artifacts in natural language inference data S Gururangan, S Swayamdipta, O Levy, R Schwartz, SR Bowman, ... Proceedings of NAACL, 2018 | 1393 | 2018 |
| GPQA: A graduate-level google-proof Q&A benchmark D Rein, BL Hou, AC Stickland, J Petty, RY Pang, J Dirani, J Michael, ... arXiv preprint arXiv:2311.12022, 2023 | 1272 | 2023 |
| What do you learn from context? Probing for sentence structure in contextualized word representations I Tenney, P Xia, B Chen, A Wang, A Poliak, RT McCoy, N Kim, ... Proceedings of ICLR, 2019 | 1081 | 2019 |
| CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models N Nangia, C Vania, R Bhalerao, SR Bowman Proceedings of EMNLP, 2020 | 917 | 2020 |
| Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ... arXiv preprint arXiv:2209.07858, 2022 | 831 | 2022 |
| On Measuring Social Biases in Sentence Encoders C May, A Wang, S Bordia, SR Bowman, R Rudinger Proceedings of NAACL-HLT, 2019 | 813 | 2019 |
| Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting M Turpin, J Michael, E Perez, S Bowman Advances in Neural Information Processing Systems 36, 2023 | 661 | 2023 |
| BLiMP: A benchmark of linguistic minimal pairs for english A Warstadt, A Parrish, H Liu, A Mohananey, W Peng, SF Wang, ... TACL, 2020 | 628 | 2020 |
| Language models (mostly) know what they know S Kadavath, T Conerly, A Askell, T Henighan, D Drain, E Perez, ... arXiv preprint arXiv:2207.05221, 2022 | 621 | 2022 |
| BBQ: A Hand-Built Bias Benchmark for Question Answering A Parrish, A Chen, N Nangia, V Padmakumar, J Phang, J Thompson, ... Findings of ACL, 2022 | 603 | 2022 |
| Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks J Phang, T Févry, SR Bowman arXiv preprint 1811.01088, 2018 | 528 | 2018 |