class: center, middle, inverse, title-slide # Big Data, Big Problems? ## Insights from Teaching Web Scraping ### Fabian Gülzau ### HU Berlin ### 30 August, 2021 --- # Content 1. Teaching Web Scraping + Experience + Curriculum + Issues + Approach 2. Developments in Web Scraping 3. Suggestions 4. Conclusion --- # Experience - Teaching primers and one-day workshops on web scraping - postgraduate level and faculty - Workshops based on R - Participants with basic/intermediate programming skills --- # Curriculum ![](data:image/png;base64,#index_files/figure-html/curriculum-1.png)<!-- --> --- # Issues .pull-left[ - Time constraints (~1h time slot) - Multiple third-parties - Website operators - Users - ... - Many cases & tools - Social media - Corporate data - APIs, personal websites... ] .pull-right[
] --- # Approach > "Big data? Cheap. Lawyers? Not so much." .right[<font size="2">— Pete Warden (cit. in Mitchell 2015. Web Scraping with Python) </font size>] - Inform participants (ToS, non-reactive data, ["friendly scraping"](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01)) - Ethics commission - Introduce tools (e.g. `robotstxt`, `polite`) - Discuss participants' research <img src="data:image/png;base64,#./img/CommonCrawl.webp" width="65%" /> <font size="2">Image source: [Rudis (2017)](https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/)</font size> --- # Developments in Web Scraping .pull-left[ #### First stage - Abundance of data - No professional standards #### Second stage - Access limited by companies - Professional deliberation #### Third stage - Common standards & rules .left[.footnote[<font size="3">(Bruns 2019, Puschmann 2019, Salganik 2018)</font size>]] ] .pull-right[
] --- # Suggestions .pull-left[ - Case studies - Guidelines - History of web scraping <font size = "3">(APIcalypse)</font size> - Data scandals (CambridgeAnalytica, Emotional-contagion) ] .pull-right[ #### Nowcasting migrant stocks <img src="data:image/png;base64,#./img/Zagheni_et_al_2017_LeveragingFacebooksAdvertisingPlatform.PNG" width="923" /> - Rewards & risks - ToS (Facebook) - Users (i.e. migrants) ] --- # Conclusion .pull-left[ #### Issues - Limited time - (Few) guidelines <font size = "3">(but see: Salganik 2019)</font size> - Multiple third-parties #### How to move forward? - Discuss teaching materials - Collection of case studies ] .pull-right[ </br></br></br></br></br></br></br></br></br></br></br></br> [
Slides: https://bit.ly/3CUuU4m](https://fabianfox.github.io/BigDataBigProblems/) [
https://fguelzau.rbind.io](https://fguelzau.rbind.io) [
fabianguelzau@hu-berlin.de](mailto:fabianguelzau@hu-berlin.de)] --- # Bibliography **Bruns, A.** (2019) After the ‘APIcalypse’: social media platforms and their fight against critical scholarly research, Information, Communication & Society, 22:11, 1544-1566, DOI: 10.1080/1369118X.2019.1637447 **Mitchell, R.** (2015) Web scraping with python. Collecting Data from the modern web, Sebastopol: O'Reilly. **Puschmann, C.** (2019) An end to the wild west of social media research: a response to Axel Bruns, Information, Communication & Society, 22:11, 1582-1589, DOI: 10.1080/1369118X.2019.1646300 **Rudis, B.** (2017) Analyzing “crawl-delay” settings in common crawl robots.txt data with R, https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/. **Salganik, M.J.** (2018) Bit by bit. Social research in the digital age, Princeton: Princeton University. **Zagheni, E., Weber, I. and K. Gummadi** (2017) Leveraging facebook's advertising platform to monitor stocks of migrants, Population and Development Review, 43:4, 721-732, 10.1111/padr.12102