Big Data, Big Problems?

class: center, middle, inverse, title-slide

# Big Data, Big Problems?
## Insights from Teaching Web Scraping
### Fabian Gülzau
### HU Berlin
### 30 August, 2021

---

# Content

1. Teaching Web Scraping
  + Experience
  + Curriculum
  + Issues
  + Approach
2. Developments in Web Scraping
3. Suggestions
4. Conclusion

---

# Experience

- Teaching primers and one-day workshops on web scraping
  - postgraduate level and faculty
- Workshops based on R
- Participants with basic/intermediate programming skills

---

# Curriculum

![](data:image/png;base64,#index_files/figure-html/curriculum-1.png)

---

# Issues

.pull-left[
- Time constraints (~1h time slot)
- Multiple third-parties
  - Website operators
  - Users
  - ...
- Many cases & tools
  - Social media
  - Corporate data
  - APIs, personal websites...
]

.pull-right[
<div id="htmlwidget-dd31b0c945a1828be265" style="width:80%;height:504px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-dd31b0c945a1828be265">{"x":{"diagram":"\ndigraph stakeholders {\n\ngraph [overlap = true, fontsize = 10, fontname = Montserrat]\n\nnode [shape = box]\nnodeA [label = \"Researcher\"];\nnodeB [label = \"Web\noperator\"];\nnodeC [label = \"User\"]\n\nnodeA->nodeB [label=\"Legalities\", fontsize = 7];\nnodeA->nodeC [label=\" Ethics\", fontsize = 7]\n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>
]

---

# Approach

> "Big data? Cheap. Lawyers? Not so much."

.right[<font size="2">— Pete Warden (cit. in Mitchell 2015. Web Scraping with Python) </font size>]

- Inform participants (ToS, non-reactive data, ["friendly scraping"](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01))
- Ethics commission
- Introduce tools (e.g. `robotstxt`, `polite`)
- Discuss participants' research

<font size="2">Image source: [Rudis (2017)](https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/)</font size>

---

# Developments in Web Scraping

.pull-left[

#### First stage
- Abundance of data
- No professional standards

#### Second stage
- Access limited by companies
- Professional deliberation

#### Third stage
- Common standards & rules

.left[.footnote[<font size="3">(Bruns 2019, Puschmann 2019, Salganik 2018)</font size>]]

]

.pull-right[
<div id="htmlwidget-a737b1ab2c390ef557e5" style="width:50%;height:504px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-a737b1ab2c390ef557e5">{"x":{"diagram":"\ndigraph developments {\n\ngraph [overlap = true, fontsize = 10, fontname = Montserrat]\n\nnode [shape = box]\nnodeA [label = \"Wild west\"];\nnodeB [label = \"APIcalypse\"];\nnodeC [label = \"Legal & ethical\nreflections\"]\n\nnodeA->nodeB\nnodeB->nodeC\n}","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>
]

---

# Suggestions

.pull-left[
- Case studies
- Guidelines
- History of web scraping <font size = "3">(APIcalypse)</font size>
- Data scandals (CambridgeAnalytica, Emotional-contagion)
]

.pull-right[

#### Nowcasting migrant stocks

- Rewards & risks
- ToS (Facebook)
- Users (i.e. migrants)
]

---

# Conclusion

.pull-left[
#### Issues
- Limited time
- (Few) guidelines <font size = "3">(but see: Salganik 2019)</font size>
- Multiple third-parties

#### How to move forward?
- Discuss teaching materials
- Collection of case studies
]

.pull-right[
</br></br></br></br></br></br></br></br></br></br></br></br>
[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M326.612 185.391c59.747 59.809 58.927 155.698.36 214.59-.11.12-.24.25-.36.37l-67.2 67.2c-59.27 59.27-155.699 59.262-214.96 0-59.27-59.26-59.27-155.7 0-214.96l37.106-37.106c9.84-9.84 26.786-3.3 27.294 10.606.648 17.722 3.826 35.527 9.69 52.721 1.986 5.822.567 12.262-3.783 16.612l-13.087 13.087c-28.026 28.026-28.905 73.66-1.155 101.96 28.024 28.579 74.086 28.749 102.325.51l67.2-67.19c28.191-28.191 28.073-73.757 0-101.83-3.701-3.694-7.429-6.564-10.341-8.569a16.037 16.037 0 0 1-6.947-12.606c-.396-10.567 3.348-21.456 11.698-29.806l21.054-21.055c5.521-5.521 14.182-6.199 20.584-1.731a152.482 152.482 0 0 1 20.522 17.197zM467.547 44.449c-59.261-59.262-155.69-59.27-214.96 0l-67.2 67.2c-.12.12-.25.25-.36.37-58.566 58.892-59.387 154.781.36 214.59a152.454 152.454 0 0 0 20.521 17.196c6.402 4.468 15.064 3.789 20.584-1.731l21.054-21.055c8.35-8.35 12.094-19.239 11.698-29.806a16.037 16.037 0 0 0-6.947-12.606c-2.912-2.005-6.64-4.875-10.341-8.569-28.073-28.073-28.191-73.639 0-101.83l67.2-67.19c28.239-28.239 74.3-28.069 102.325.51 27.75 28.3 26.872 73.934-1.155 101.96l-13.087 13.087c-4.35 4.35-5.769 10.79-3.783 16.612 5.864 17.194 9.042 34.999 9.69 52.721.509 13.906 17.454 20.446 27.294 10.606l37.106-37.106c59.271-59.259 59.271-155.699.001-214.959z"/></svg> Slides:  https://bit.ly/3CUuU4m](https://fabianfox.github.io/BigDataBigProblems/)

[<svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M280.37 148.26L96 300.11V464a16 16 0 0 0 16 16l112.06-.29a16 16 0 0 0 15.92-16V368a16 16 0 0 1 16-16h64a16 16 0 0 1 16 16v95.64a16 16 0 0 0 16 16.05L464 480a16 16 0 0 0 16-16V300L295.67 148.26a12.19 12.19 0 0 0-15.3 0zM571.6 251.47L488 182.56V44.05a12 12 0 0 0-12-12h-56a12 12 0 0 0-12 12v72.61L318.47 43a48 48 0 0 0-61 0L4.34 251.47a12 12 0 0 0-1.6 16.9l25.5 31A12 12 0 0 0 45.15 301l235.22-193.74a12.19 12.19 0 0 1 15.3 0L530.9 301a12 12 0 0 0 16.9-1.6l25.5-31a12 12 0 0 0-1.7-16.93z"/></svg> https://fguelzau.rbind.io](https://fguelzau.rbind.io)

[<svg aria-hidden="true" role="img" viewBox="0 0 512 512" style="height:1em;width:1em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"/></svg> fabianguelzau@hu-berlin.de](mailto:fabianguelzau@hu-berlin.de)]

---

# Bibliography

**Bruns, A.** (2019) After the ‘APIcalypse’: social media platforms and their fight against critical scholarly research, Information, Communication & Society, 22:11, 1544-1566, DOI: 10.1080/1369118X.2019.1637447

**Mitchell, R.** (2015) Web scraping with python. Collecting Data from the modern web, Sebastopol: O'Reilly.

**Puschmann, C.** (2019) An end to the wild west of social media research: a response to Axel Bruns, Information, Communication & Society, 22:11, 1582-1589, DOI: 10.1080/1369118X.2019.1646300

**Rudis, B.** (2017) Analyzing “crawl-delay” settings in common crawl robots.txt data with R, https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/.

**Salganik, M.J.** (2018) Bit by bit. Social research in the digital age, Princeton: Princeton University.

**Zagheni, E., Weber, I. and K. Gummadi** (2017) Leveraging facebook's advertising platform
to monitor stocks of migrants, Population and Development Review, 43:4, 721-732, 10.1111/padr.12102