A great initiative! We need a better URL parser in Scrapy, for similar reasons. ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		kmike84 on March 16, 2024 \| parent \| context \| favorite \| on: Parsing URLs in Python A great initiative! We need a better URL parser in Scrapy, for similar reasons. Speed and WHATWG standard compliance (i.e. do the same as web browsers) are the main things. It's possible to get closer to WHATWG behavior by using urllib and some hacks. This is what https://github.com/scrapy/w3lib does, which Scrapy currently uses. But it's still not quite compliant. Also, surprisingly, on some crawls URL parsing can take CPU amounts similar to HTML parsing. Ada / can_ada look very promising!

TkTech on March 16, 2024 [–]

can_ada dev here. Scrapy is a fantastic project, we used it extensively at 360pi (now Numerator), making trillions of requests. Let me know if I can help :)

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact