+ - 0:00:00
Notes for current slide
Notes for next slide

Big Data, Big Problems?

Insights from Teaching Web Scraping

Fabian Gülzau

HU Berlin

30 August, 2021

1 / 10

Content

  1. Teaching Web Scraping
    • Experience
    • Curriculum
    • Issues
    • Approach
  2. Developments in Web Scraping
  3. Suggestions
  4. Conclusion
2 / 10

Experience

  • Teaching primers and one-day workshops on web scraping
    • postgraduate level and faculty
  • Workshops based on R
  • Participants with basic/intermediate programming skills
3 / 10

Curriculum

4 / 10

Issues

  • Time constraints (~1h time slot)
  • Multiple third-parties
    • Website operators
    • Users
    • ...
  • Many cases & tools
    • Social media
    • Corporate data
    • APIs, personal websites...
stakeholders nodeA Researcher nodeB Web operator nodeA->nodeB Legalities nodeC User nodeA->nodeC Ethics
5 / 10

Approach

"Big data? Cheap. Lawyers? Not so much."

— Pete Warden (cit. in Mitchell 2015. Web Scraping with Python)

  • Inform participants (ToS, non-reactive data, "friendly scraping")
  • Ethics commission
  • Introduce tools (e.g. robotstxt, polite)
  • Discuss participants' research

Image source: Rudis (2017)

6 / 10

Developments in Web Scraping

First stage

  • Abundance of data
  • No professional standards

Second stage

  • Access limited by companies
  • Professional deliberation

Third stage

  • Common standards & rules

(Bruns 2019, Puschmann 2019, Salganik 2018)

developments nodeA Wild west nodeB APIcalypse nodeA->nodeB nodeC Legal & ethical reflections nodeB->nodeC
7 / 10

Suggestions

  • Case studies
  • Guidelines
  • History of web scraping (APIcalypse)
  • Data scandals (CambridgeAnalytica, Emotional-contagion)

Nowcasting migrant stocks

  • Rewards & risks
  • ToS (Facebook)
  • Users (i.e. migrants)
8 / 10

Conclusion

Issues

  • Limited time
  • (Few) guidelines (but see: Salganik 2019)
  • Multiple third-parties

How to move forward?

  • Discuss teaching materials
  • Collection of case studies
9 / 10

Bibliography

Bruns, A. (2019) After the ‘APIcalypse’: social media platforms and their fight against critical scholarly research, Information, Communication & Society, 22:11, 1544-1566, DOI: 10.1080/1369118X.2019.1637447

Mitchell, R. (2015) Web scraping with python. Collecting Data from the modern web, Sebastopol: O'Reilly.

Puschmann, C. (2019) An end to the wild west of social media research: a response to Axel Bruns, Information, Communication & Society, 22:11, 1582-1589, DOI: 10.1080/1369118X.2019.1646300

Rudis, B. (2017) Analyzing “crawl-delay” settings in common crawl robots.txt data with R, https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/.

Salganik, M.J. (2018) Bit by bit. Social research in the digital age, Princeton: Princeton University.

Zagheni, E., Weber, I. and K. Gummadi (2017) Leveraging facebook's advertising platform to monitor stocks of migrants, Population and Development Review, 43:4, 721-732, 10.1111/padr.12102

10 / 10

Content

  1. Teaching Web Scraping
    • Experience
    • Curriculum
    • Issues
    • Approach
  2. Developments in Web Scraping
  3. Suggestions
  4. Conclusion
2 / 10
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow