Page 1 of 2
1
2
LastLast
  1. #1

    Question Looking for help with building a corpus for my thesis

    Hey guys!


    4th year English studies student here.

    I'm about to start working on my thesis and I've decided to focus on spelling and punctuation on the "interwebz". It would have been easier to focus on instant-messaging apps, but I would have a very difficult time getting my hands on actual conversations, so this option is pretty much discarded.


    Instead, I'm thinking of focusing on something that provides conversation-style material, such as a forum or comments system similar to Youtube's. The point is to have something as close to a conversation as possible.


    I have some ideas already, but nevertheless it would be extremely helpful to get some opinions on sources that could be good material for this kind of corpus and analysis.


    Thanks in advance!

  2. #2
    That depends, how are you pulling the data (e.g. copy/paste--I'm assuming this as you're an English studies student)?

    Also, do you have a stylistic preference? More academic forums will yield more grammatically and syntactically correct data while something like YouTube will likely be drastically different.

  3. #3
    Yes, the plan is to copy-paste the data into Antconc. Once I will have built the corpus I can start analysing it.

  4. #4
    The Insane Kujako's Avatar
    10+ Year Old Account
    Join Date
    Oct 2009
    Location
    In the woods, doing what bears do.
    Posts
    17,987
    Quote Originally Posted by Kdewd View Post
    4th year English studies student here.
    My condolences?
    It is by caffeine alone I set my mind in motion. It is by the beans of Java that thoughts acquire speed, the hands acquire shakes, the shakes become a warning.

    -Kujako-

  5. #5
    Quote Originally Posted by Kujako View Post
    My condolences?
    I'm still alive and healthy, thanks.

  6. #6
    So it sounds like you want to observe and study the bastardization of the English language via technology, yes?

    If so then you have to go right to the source: Teenagers.

    Find forums centered around shit teenagers like.

    Maybe go dig up the archives of an old Twilight fan forum.

    Probably lots of OH EM GE AATK FCOL RPAT ISH CS&F HEHEH CUL8R!
    MAGA
    When all you do is WIN WIN WIN

  7. #7
    Quote Originally Posted by Kdewd View Post
    I'm still alive and healthy, thanks.
    If you're studying anything that isn't STEM related you're a filthy sjw libtard future starbucks barrista on this forum

    - - - Updated - - -

    Quote Originally Posted by TrumpIsPresident View Post
    So it sounds like you want to observe and study the bastardization of the English language via technology, yes?

    If so then you have to go right to the source: Teenagers.

    Find forums centered around shit teenagers like.

    Maybe go dig up the archives of an old Twilight fan forum.

    Probably lots of OH EM GE AATK FCOL RPAT ISH CS&F HEHEH CUL8R!
    https://www.facebook.com/Teens4Trump/

    That one's better imo

  8. #8
    Quote Originally Posted by TrumpIsPresident View Post
    So it sounds like you want to observe and study the bastardization of the English language via technology, yes?

    If so then you have to go right to the source: Teenagers.

    Find forums centered around shit teenagers like.

    Maybe go dig up the archives of an old Twilight fan forum.

    Probably lots of OH EM GE AATK FCOL RPAT ISH CS&F HEHEH CUL8R!
    Yes, that's basically what I'll be doing. And, thankfully, my tutor likes the idea too.
    Thanks for your suggestions, much appreciated!

    Quote Originally Posted by Raskayz View Post
    If you're studying anything that isn't STEM related you're a filthy sjw libtard future starbucks barrista on this forum

    - - - Updated - - -

    Eh, I don't mind. At least it's not as bad here in Europe. Yet.

    https://www.facebook.com/Teens4Trump/

    That one's better imo
    Thanks, I'll have a look!
    Last edited by Kdewd; 2017-01-16 at 04:39 PM.

  9. #9
    Quote Originally Posted by TrumpIsPresident View Post
    So it sounds like you want to observe and study the bastardization of the English language via technology, yes?

    If so then you have to go right to the source: Teenagers.

    Find forums centered around shit teenagers like.

    Maybe go dig up the archives of an old Twilight fan forum.

    Probably lots of OH EM GE AATK FCOL RPAT ISH CS&F HEHEH CUL8R!
    It's a thesis so I'm assuming he needs multiple sources to avoid confirmation bias on the target of the thesis topic and for control groups.

    ...And I just realized that might be applying more sociological study techniques than... hang on, OP, what exactly are you asking for help here on? Your initial statement seems to imply you're actually studying the effects of the Internet medium of conversation on the English language as a whole, but my second read over makes it look more vague. What's the actual study, or are you just asking for random suggestions of 'people typing' on the internet somewhere?

  10. #10
    Deleted
    why dont you use Trumps Twitter

  11. #11
    Quote Originally Posted by Myrryr View Post
    It's a thesis so I'm assuming he needs multiple sources to avoid confirmation bias on the target of the thesis topic and for control groups.

    ...And I just realized that might be applying more sociological study techniques than... hang on, OP, what exactly are you asking for help here on? Your initial statement seems to imply you're actually studying the effects of the Internet medium of conversation on the English language as a whole, but my second read over makes it look more vague. What's the actual study, or are you just asking for random suggestions of 'people typing' on the internet somewhere?
    As a first step, yes, I'm basically looking for suggestions of "people typing" in a conversation-like setting (as I said, it would be almost impossible to obtain data from IM in English. Unless random people are willing to give up their personal conversations). Once I have the data, I want to analyze the specific vocabulary, abbreviations, (lack of) punctuation, perception and so on, in comparison to what is considered "grammatical".

  12. #12
    Quote Originally Posted by Kdewd View Post
    Yes, the plan is to copy-paste the data into Antconc. Once I will have built the corpus I can start analysing it.
    That sucks ass. Regardless, you could probably Scraper to make life a little bit easier, assuming you need a large data set for your corpus.

    Regarding material, what were your ideas/reasoning and preferences? Something like Kaggle's forums, for instance tends to be very formal if you're looking for formal English.

  13. #13
    Quote Originally Posted by Kdewd View Post
    As a first step, yes, I'm basically looking for suggestions of "people typing" in a conversation-like setting (as I said, it would be almost impossible to obtain data from IM in English. Unless random people are willing to give up their personal conversations). Once I have the data, I want to analyze the specific vocabulary, abbreviations, (lack of) punctuation, perception and so on, in comparison to what is considered "grammatical".
    Well, if you're looking for snippets of IM conversations, there's always those wonderful screenshots of amusing auto-corrects on phones, but beyond that you'd have to go to a place like /b/ in order to find good IM stories I'd imagine.

    Facebook screenshots would be good. Comment sections of news sites contain people using good and bad English all the time. Even image boards like 4chan have their own conversation threads. Q and A's on reddit for different examples of conversations is another possibility.

  14. #14
    Quote Originally Posted by bolly View Post
    That sucks ass. Regardless, you could probably Scraper to make life a little bit easier, assuming you need a large data set for your corpus.

    Regarding material, what were your ideas/reasoning and preferences? Something like Kaggle's forums, for instance tends to be very formal if you're looking for formal English.
    My bet would be on informal English as this, combined with internet slang, would show more differences when compared to formal/standard language.

  15. #15
    Quote Originally Posted by Kdewd View Post
    informal English
    Facebook.

    /10char

  16. #16
    Quote Originally Posted by Myrryr View Post
    Facebook.

    /10char
    Hmm, this was one of my first thoughts as well, though I wasn't/ am not too sure about it because FB users tend (not saying they all do, it's just my impression) to use their mobiles, which autocorrect your sentences and punctuation.

    Maybe I just need to dig deeper until I can find more... "obscure" pages.

  17. #17
    If you're trying to avoid any form of auto-correct, your best bet will actually be to ask about things like guild chat logs in games as very few if any games have auto-correct in their chat programs, but most things that are chat programs of a kind will have some form of it today, even if it's just the wonderful red squiggly line telling the user.

  18. #18
    The Insane Aeula's Avatar
    10+ Year Old Account
    Join Date
    Nov 2011
    Location
    Nearby, preventing you from fast traveling.
    Posts
    17,415
    4Chan has interesting conversations from time to time.

  19. #19
    Quote Originally Posted by Kdewd View Post
    My bet would be on informal English as this, combined with internet slang, would show more differences when compared to formal/standard language.
    You're likely going to have to filter your demographic if you want to create such a bias, but that would lead to notable biases. Some niches like the hip-hop channels/sites will commonly bring internet slang as well as hood slang while gaming channels/sites will probably make Harambe your most popular word.

  20. #20
    Quote Originally Posted by bolly View Post
    You're likely going to have to filter your demographic if you want to create such a bias, but that would lead to notable biases. Some niches like the hip-hop channels/sites will commonly bring internet slang as well as hood slang while gaming channels/sites will probably make Harambe your most popular word.
    I see where you're coming from, though things aren't set in stone yet. I will also have to talk to my tutor again soon. For now,though, having this discussion here helps a lot.

    Quote Originally Posted by Myrryr View Post
    If you're trying to avoid any form of auto-correct, your best bet will actually be to ask about things like guild chat logs in games as very few if any games have auto-correct in their chat programs, but most things that are chat programs of a kind will have some form of it today, even if it's just the wonderful red squiggly line telling the user.

    Yup, I agree about the guild chat logs thing. Another similar idea would be to contact a page like "chatroullete" and the likes and ask whether they're willing to provide some of their logs.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •